# regularization_for_unsupervised_deep_neural_nets__cc3bcf54.pdf

Regularization for Unsupervised Deep Neural Nets

Baiyang Wang, Diego Klabjan Department of Industrial Engineering and Management Sciences, Northwestern University, 2145 Sheridan Road, C210 Evanston, Illinois 60208

Unsupervised neural networks, such as restricted Boltzmann machines (RBMs) and deep belief networks (DBNs), are powerful tools for feature selection and pattern recognition tasks. We demonstrate that overﬁtting occurs in such models just as in deep feedforward neural networks, and discuss possible regularization methods to reduce overﬁtting. We also propose a partial approach to improve the efﬁciency of Dropout/Drop Connect in this scenario, and discuss the theoretical justiﬁcation of these methods from model convergence and likelihood bounds. Finally, we compare the performance of these methods based on their likelihood and classiﬁcation error rates for various pattern recognition data sets.

1 Introduction

Unsupervised neural networks assume unlabeled data to be generated from a neural network structure, and have been applied extensively to pattern analysis and recognition. The most basic one is the restricted Boltzmann machine (RBM) (Salakhutdinov, Mnih, and Hinton 2007), an energy-based model with a layer of hidden nodes and a layer of visible nodes. With such a basic structure, we can stack multiple layers of RBMs to create an unsupervised deep neural network structure, such as the deep belief network (DBN) and the deep Boltzmann machine (DBM) (Hinton, Osindero, and Teh 2006; Salakhutdinov and Hinton 2009a). These models can be calibrated with a combination of the stochastic gradient descent and the contrastive divergence (CD) algorithm or the PCD algorithm (Salakhutdinov, Mnih, and Hinton 2007; Tieleman 2008). Once we learn the parameters of a model, we can retrieve the values of the hidden nodes from the visible nodes, thus applying unsupervised neural networks for feature selection. Alternatively, we may consider applying the parameters obtained from an unsupervised deep neural network to initialize a deep feedforward neural network (FFNN), thus improving supervised learning. One essential question for such models is to adjust for the high-dimensionality of their parameters and avoid overﬁtting. In FFNNs, the simplest regularization is arguably the early stopping method, which stops the gradient descent algorithm before the validation error rate goes up. The weight

Copyright c 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

decay method, or Ls regularization, is also commonly used (Witten, Frank, and Hall 2011). Recently Dropout is proposed, which optimizes the parameters over an average of exponentially many models with a subset of all nodes (Srivastava et al. 2014). It has been shown to outperform weight decay regularization in many situations. For regularizing unsupervised neural networks, sparse RBM-type models encourage a smaller proportion of 1valued hidden nodes (Cho, Ilin, and Taiko 2012; Lee, Ekanadham, and Ng 2007). DBNs are regularized in Goh et al. (2013) with outcome labels. While these works tend to be goal-speciﬁc, we consider regularization for unsupervised neural networks in a more general setting. Our work and contributions are as follows: (1) we extend common regularization methods to unsupervised deep neural networks, and explain their underlying mechanisms; (2) we propose partial Dropout/Drop Connect which can improve the performance of Dropout/Drop Connect; (3) we compare the performance of different regularization methods on real data sets, thus providing suggestions on regularizing unsupervised neural networks. We note that this is the very ﬁrst study illustrating the mechanisms of various regularization methods for unsupervised neural nets with model convergence and likelihood bounds, including the effective newly proposed partial Dropout/Drop Connect. Section 2 reviews recent works for regularizing neural networks, and Section 3 exhibits RBM regularization as a basis for regularizing deeper networks. Section 4 discusses the model convergence of each regularization method. Section 5 extends regularization to unsupervised deep neural nets. Section 6 presents a numerical comparison of different regularization methods on RBM, DBN, DBM, RSM (Salakhutdinov and Hinton 2009b) and Gaussian RBM (Salakhutdinov, Mnih, and Hinton 2007). Section 7 discusses potential future research and concludes the paper.

2 Related Works

To begin with, we consider a simple FFNN with a single layer of input ı = (ı1, . . . , ıI)T and a single layer of output o = (o1, . . . , o J)T {0, 1}J. The weight matrix W is of size J I. We assume the relation

E(o) = a(W ı), (1)

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

where a( ) is the activation function, such as the sigmoid function σ(x) = 1/(1 + e x) applied element-wise. Equation (1) has the modiﬁed form in Srivastava et al. (2014), E(o|m) = a(m (W ı)),

m = (m1, . . . , m J)T iid Ber(p), (2)

where denotes element-wise multiplication, and Ber( ) denotes the Bernoulli distribution, thereby achieving the Dropout (DO) regularization for neural networks. In Dropout, we minimize the objective function

n=1 Em[log p(o(n)|ı(n), W, m)], (3)

which can be achieved by a stochastic gradient descent algorithm, sampling a different mask m per data example (o(n), ı(n)) and per iteration. We observe that this can be readily extended to deep FFNNs. Dropout regularizes neural networks because it incorporates prediction based on any subset of all the nodes, therefore penalizing the likelihood. A theoretical explanation is provided in Wager, Wang, and Liang (2013) for Dropout, noting that it can be viewed as feature noising for GLMs, and we have the relation

n=1 log p(o(n)|ı(n), W) + Rq(W). (4)

Here J = 1 for simplicity, and Rq(W) = 1 2 p 1 p N n=1 I i=1 A (Wı(n))(ı(n) i )2W 2 i , where A( ) is the log-partition function of a GLM. Therefore, Dropout can be viewed approximately as the adaptive L2 regularization (Baldi and Sadowski 2013; Wager, Wang, and Liang 2013). A recursive approximation of Dropout is provided in Baldi and Sadowski (2013) using normalized weighted geometric means to study its averaging properties. An intuitive extension of Dropout is Drop Connect (DC) (Wan et al. 2013), which has the form below E(o|m) = a((m W) ı),

m = (mij)J I iid Ber(p), (5)

and thus masks the weights rather than the nodes. The objective l DC(W) has the same form as in (3). There are a number of related model averaging regularization methods, each of which averages over subsets of the original model. For instance, Standout varies Dropout probabilities for different nodes which constitute a binary belief network (Ba and Frey 2013). Shakeout adds additional noise to Dropout so that it approximates elastic-net regularization (Kang, Li, and Tao 2016). Fast Dropout accelerates Dropout with Gaussian approximation (Wang and Manning 2013). Variational Dropout applies variational Bayes to infer the Dropout function (Kingma, Salimans, and Welling 2015). We note that while Dropout has been discussed for RBMs (Srivastava et al. 2014), to the best of our knowledge, there is no literature extending common regularization methods to RBMs and unsupervised deep neural networks; for instance,

adaptive Ls regularization and Drop Connect as mentioned. Therefore, below we discuss their implementations and examine their empirical performance. In addition to studying model convergence and likelihood bounds, we propose partial Dropout/Drop Connect which iteratively drops a subset of nodes or edges based on a given calibrated model, therefore improving robustness in many situations.

3 RBM Regularization

For a Restricted Boltzmann machine, we assume that v = (v1, , v J)T {0, 1}J denotes the visible vector, and h = (h1, , h I)T {0, 1}I denotes the hidden vector. Each vj, j = 1, . . . , J is a visible node and each hi, i = 1, . . . , I is a hidden node. The joint probability is P(v, h) = e E(v,h)/ ν,η e E(ν,η), E(v, h) = b T v c T h h T Wv. (6)

We let the parameters ϑ = (b, c, W) Θ, which is a vector containing all components of b, c, and W. To calibrate the model is to ﬁnd ˆθ = arg max ϑ Θ N n=1 log P(v(n)|ϑ).

An RBM is a neural network because we have the following conditional probabilities P(hi = 1|v) = σ(ci + Wi v), P(vj = 1|h) = σ(bj + W T j h), (7)

where Wi and W j represent, respectively, the i-th row and j-th column of W. The gradient descent algorithm is applied to calibration. The gradient of the log-likelihood can be expressed in the following form

log P(v(n))

ϑ = F(v(n))

v {0,1}J P(v) F(v)

where F(v) = b T v I i=1 log(1 + eci+Wi v) is the free energy. The right-hand side of (8) is approximated by contrastive divergence with k steps of Gibbs sampling (CD-k) (Salakhutdinov, Mnih, and Hinton 2007).

3.1 Weight Decay Regularization

Weight decay, or Ls regularization, adds the term λ W s s to the negative log-likelihood of an RBM. The most commonly used is L2 (ridge regression), or L1 (LASSO). In all situations, we do not regularize biases for simplicity. Here we consider a more general form. Suppose we have a trained set of weights W from CD with no regularization. Instead of adding the term λ W s s, we add the term μ IJ

i,j |Wij|s/| ˆWij|s to the negative log-likelihood. Apparently this adjusts for the different scales of the components of W. We refer to this approach as adaptive Ls. We note that adaptive L1 is the adaptive LASSO (Zou 2006), and adaptive L2 plus L1 is the elastic-net (Zou and Hastie 2005). We consider the performance of L2 regularization plus adaptive L1 regularization (L2 + AL1) below.

3.2 Model Averaging Regularization As discussed in Srivastava et al. (2014), to characterize a Dropout (DO) RBM, we simply need to apply the following conditional distributions PDO(hi = 1|v, m) = mi σ(ci + Wi v), PDO(vj = 1|h, m) = σ(bj + W T j h). (9)

Therefore, given a ﬁxed mask m {0, 1}I, we actually obtain an RBM with all visible nodes v and hidden nodes {hi : mi = 1}. Hidden nodes {hi : mi = 0} are ﬁxed to zero so they have no inﬂuence on the conditional RBM. Apart from replacing (7) with (9), the only other change needed is to replace F(v) with FDO(v|m) = b T v I i=1 mi log(1 + eci+Wi v). In terms of training, we suggest sampling a different mask per data example v(n) and per iteration as in Srivastava et al. (2014). A Drop Connect (DC) RBM is closely related; given a mask m = {0, 1}IJ on weights W, W in a plain RBM is replaced by m W everywhere. We suggest sampling a different mask m per mini-batch since it is usually much larger than a mask in a Dropout RBM.

3.3 Network Pruning Regularization There are typically many nodes or weights which are of little importance in a neural network. In network pruning, such unimportant nodes or weights are discarded, and the neural network is retrained. This process can be conducted iteratively (Reed 1993). Now we consider two variants of network pruning for RBMs. For an trained set of weights ˆW with no regularization, we consider implementing a ﬁxed mask m = (mij)I J where

mij = 1| ˆ Wij| Q, Q = Q100(1 p)%(| ˆW|), (10)

i.e. Q is the 100(1 p)%-th left percentile of all | ˆWij|, and p (0, 1) is some ﬁxed proportion of retained weights. We then recalibrate the weights and biases ﬁxing mask m, leading to a simple network pruning (SNP) procedure which deletes 100(1 p)% of all weights. We may also consider deleting 100(1 p)/r% of all weights at a time, and conduct the above process r times, leading to an iterative network pruning (INP) procedure.

3.4 Hybrid Regularization We may consider combining some of the above approaches. For instance, Srivastava et al. (2014) considered a combination of Ls and Dropout. We introduce two new hybrid approaches, namely partial Drop Connect (PDC) presented in Algorithm 1 and partial Dropout (PDO), which generalizes Drop Connect and Dropout, and borrows from network pruning. The rationale comes from some of the model convergence results exhibited later. As before, suppose we have a trained set of weights ˆW with no regularization. Instead of implementing a ﬁxed mask m, we perform Drop Connect regularization with different retaining probabilities pij for each weight Wij. We let the quantile Q = Q100(1 q)%(| ˆW|), and

pij = 1| ˆ Wij| Q + p0 1| ˆ Wij|<Q. (11)

Therefore, we sample a different m = (mij)I J ind Ber(pij) per mini-batch, which means that we always keep 100q% of all the weights, and randomly drop the remaining weights with probability 100(1 p0)%. The mask m can be resampled iteratively. Intuitively, we are trying to maximize the following

max ϑ Θ,pij {p0,1} Em[log P(data|ϑ, m)]. (12)

such that mij ind Ber(pij), and 1pij=1 = q IJ.

Algorithm 1. (Partial Drop Connect)

1. Initialize ˆθp = ˆθ, the unregularized trained parameters for an RBM.

2. Find retaining rates p = (pij)I J from (11).

3. Retrain weights ϑ with Drop Connect for a given number of iterations, and then update ˆθp.

4. If maximum number of iterations reached, stop and obtain ˆθp; otherwise, go back to Step 2.

This technique is proposed because we hypothesize that some weights could be more important than others a posteriori, so dropping them could cause much variation among the models being averaged. From (11), in partial Dropout, we tend to drop weights which have smaller magnitude, since setting larger weights to zero may substantially alter the structure of a neural network. Experiments on real data show that this technique can effectively improve the performance of plain Drop Connect. We denote lp(ϑ) = Em Ber(p)[log P(data|ϑ, m)], and l(ϑ) = log P(data|ϑ). From ﬁrst-order Taylor s expansion,

|lp(ˆθp) l(ˆθ)| |lp(ˆθp) l(ˆθp)| + |l(ˆθp) l(ˆθ)|

Wij l( θ)(1 pij) ˆWij,p

+ |l(ˆθp) l(ˆθ)|

K (1 pij)| ˆWij,p| + |l(ˆθp) l(ˆθ)|. (13)

Here θ lies between ˆθp and m ˆθp from Taylor s expansion, and K = supϑ Θ

ϑl(ϑ) is a Lipschitz constant. Note that given p and ˆθp, Step 2 in Algorithm 1 lowers the term (1 pij)| ˆWij,p| by assigning (1 p0) to weights of smaller magnitude, reducing an upper bound of |lp(ˆθp) l(ˆθ)|. Step 3 further increases lp(ˆθp) and reduces the gap |lp(ˆθp) l(ˆθ)|. Therefore, each iteration of Algorithm 1 tends to increase lp(ˆθp), and hence Algorithm 1 provides an intuitive solution to problem (12). We also consider a partial Dropout approach which is analogous to partial Drop Connect and keeps some important nodes rather than weights. We set a mask for nodes m = (m1, . . . , m I), mi ind Ber(pi), where pi = 1 ˆ Wi Q + p0 1 ˆ Wi <Q, Q = Q100(1 q)%( ˆWi ). (14)

This algorithm protects more important hidden nodes from being dropped in order to reduce variation. We also evaluate its empirical performance later.

4 More Theoretical Considerations Here we discuss the model convergence properties of different regularization methods when the number of data examples N . We mark all regularization coefﬁcients and parameter estimates with (N) when there are N data examples. We assume ϑ = (b, c, W) Θ, which is compact, dim(Θ) = D, P(v|ϑ) is unique for each ϑ Θ, and v(1), . . . , v(N) are i.i.d. generated from an RBM with a true set of parameters θ. We denote each regularized calibrated set of parameters as θ(N). Let A = {d : θd = 0} and θA = {θd : d A}. (Zou 2006) showed that AL1 guarantees asymptotic normality and identiﬁcation of set A for linear regression. We demonstrate that similar results hold for L2 + AL1 for RBMs. We let λ(N) = (λ(N) 1 , . . . , λ(N) D ) and μ(N) = (μ(N) 1 , . . . , μ(N) D ) be the L2 and L1 regularization coefﬁcients for each component. The proofs of all propositions and corollaries below are in the supplementary material (Wang and Klabjan 2016). Proposition 1. (a) If λ(N)/N 0, μ(N)/N 0 as N , then the estimate θ(N) P θ; (b) if also, μ(N) d /

0 1θd =0+ 1θd=0, λ(N)/

N 0, then n( θ(N) A θA) d N(0, I 1(θA)), where I is the Fisher information matrix; P( ˆA(N) = A) 1, where ˆA(N) = {d : θ(N) d = 0}. For Dropout and Drop Connect RBMs, we also assume that the data is generated from a plain RBM structure. We assume p(N) is of size I J as in (11) for Drop Connect and of length I as in (14) for Dropout, therefore covering the cases of both original and partial Dropout/Drop Connect with a ﬁxed set of dropping rates. With a decreasing dropping rate 1 p(N) 0 with N , we obtain the following convergence result. Proposition 2. If p(N) 1 as N , then θ(N) P θ. For network pruning, we show that as the number of data examples increase, if the retained proportion of parameters p(N) = p can cover all nonzero components of θ, we will not miss any important component. Proposition 3. Assume p > p0 := |A|/D. Then for simple network pruning, as N , (a) θ(N) P θ; (b) for sufﬁciently large N, there exists ρ > 0 such that P(A ˆA(N)) 1 e ρN. Corollary 1. The above results also hold for iterative network pruning. We note that for all regularization methods, under the above conditions, the calibrated weights converge to the true set of parameters θ, which indicates consistency. Also, adding L1 regularization guarantees that we can identify components of zero value with inﬁnitely many examples. The major beneﬁts of Dropout come from the facts that it makes L2 regularization adaptive, and also encourages more conﬁdent prediction of the outcomes (Wager, Wang, and Liang 2013). We propose partial Drop Connect

also based on Proposition 3, i.e. we do not drop the more important components of θ, therefore possibly reducing variation caused by dropping inﬂuential weights. Partial Dropout follows from the same reasoning.

5 Extension to Other Networks

5.1 Deep Belief Networks

We consider the multilayer network below,

P(v, h1, . . . , h L) P(v|h1)

l=1 P(hl|hl+1)P(h L), (15)

where each probability on the right-hand side is from an RBM. To train the weights of RBM(v, h1), . . ., RBM(h L 1, h L), we only need to carry out a greedy layerwise training approach, i.e. we ﬁrst train the weights of RBM(v, h1), and then use E(h1|v) to train RBM(h1, h2), etc. The weights of the RBMs are used to initialize a deep FFNN which is ﬁnetuned with gradient descent. RBM regularization is applicable to each layer of a DBN. Here we show that adding layers to a Dropout/Drop Connect DBN improves the likelihood given symmetry of the weights of two adjacent layers. Similar results for plain DBN are in Hinton, Osindero, and Teh (2006) and Bengio (2007). We demonstrate this by using likelihood bounds. We let DBNL denote an L-layer DBN and DBNL+1 denote an (L + 1)-layer DBN with the ﬁrst L layers being the same as in DBNL. For a data example of a visible vector v, the log-likelihood is bounded as follows,

Em[log PDBNL+1(v|m, m )]

Em[HPDBNL(h L|v,m)] +

h L Em,m {PDBNL(h L|v, m)

[log PRBML+1(h L|m ) + log PDBNL(v|h L, m)]}. (16)

Here, H is the entropy function, and the derivation is analogous to Section 11 in Bengio (2007). Mask m is for DBNL, and mask m is for the new (L + 1)-th layer. Note that after we have trained the ﬁrst L layers, and initialized the (L + 1)-th layer symmetric to the L-th layer, assuming a constant dropping probability, we have

Em [log PRBML+1(h L|m )] = Em[log PDBNL(h L|m)], (17) so DBNL+1 has the same log-likelihood bound as DBNL. Training RBML+1, Em [log PRBML+1(h L|m )] is guaranteed to increase, and therefore the likelihood of DBNL+1 is expected to improve. As a result, for regularized unsupervised deep neural nets, adding layers also tend to elevate the explanatory power of the network. Adding nodes has the same effect, providing a rationale for deep and large-scale networks. We present the following proposition. Proposition 4. Adding nodes or layers (preserving weight symmetry) to a Dropout/Drop Connect DBN continually improves the likelihood; also, adding layers of size J H1 H2 continually improves the likelihood.

5.2 Other RBM Variants More descriptions of DBMs, RSMs, and Gaussian RBMs are in the supplementary material (Wang and Klabjan 2016). RBM regularization can be extended to all these situations.

6 Data Studies In this section, we compare the empirical performance of the aforementioned regularization methods on the following data sets: MNIST, NORB (image recognition); 20 Newsgroups, Reuters21578 (text classiﬁcation); ISOLET (speech recognition). All results are obtained using Ge Force GTX TITAN X in Theano.

6.1 Experiment Settings We consider the following unsupervised neural network structures: DBN/DBM for MNIST; DBN for NORB; RSM plus logistic regression for 20 Newsgroups and Reuters21578; GRBM for ISOLET. CD-1 is performed for the rest of the paper. The following regularization methods are considered: None (no regularization); DO; DC; L2; L2 + AL1; SNP; INP(r = 3); PDO; PDC. The number of pretraining epochs is 100 per layer and the number of ﬁnetuning epochs is 300, with a ﬁnetuning learning rate of 0.1. For L2 + AL1, SNP, and INP which need re-calibration, we cut the 100 epochs into two halves (4 quarters for INP). For regularization parameters, we apply the following ranges: p = 0.8 0.9 for DO/DC/SNP/INP; λ = 10 5 10 4 for L2, similar to Hinton (2010); μ = 0.01 0.1 for L2 +AL1; p0 = 0.5, q = 0.7 0.9 or the reverse for PDO/PDC. We only make one update to the partial dropping rates to maintain simplicity. From the results, we note that unsupervised neural networks tend to need less regularization than FFNNs. We choose the best iteration and regularization parameters over a ﬁxed set of parameter values according to the validation error rates.

6.2 The MNIST Data Set The MNIST data set consists of 282 pixels of handwritten 0-9 digits. There are 50,000 training examples, 10,000 validation and 10,000 testing examples. We ﬁrst consider the likelihood of the testing data of an RBM with 500 nodes for MNIST. There are two model ﬁtting evaluation criteria: pseudo-likelihood and AIS-likelihood (Salakhutdinov and Murray 2008). The former is a sum of conditional likelihoods, while the latter directly estimates P(v) with AIS. In Figure 1 below using log-scale, p = 0.9 for DO, and λ = 10 4 for L2. These ﬁgures tend to be representative of the model ﬁtting process. The pseudo-likelihood is a more optimistic estimate of the model ﬁtting. We observe that Dropout outperforms the other two after about 50 epochs, and L2 regularization does not improve the pseudolikelihood. In terms of the AIS-likelihood, which is a much more conservative estimate of the model ﬁtting, the ﬁtting process seems to have three stages: (1) initial ﬁtting; (2) overﬁtting ; (3) re-ﬁtting. We observe that L2 improves the likelihood signiﬁcantly, while Dropout catches up at about 300 epochs. Therefore, Dropout tends to improve model ﬁtting according to both likelihood criteria.

Figure 1: Left: Pseudo-likelihood of the RBM over 500 pretraining epochs. Right: AIS-likelihood of the RBM over 500 pretraining epochs.

In Figure 2, we can observe that more nodes increase the pseudo-likelihood, which is consistent with Proposition 4, but exhibit overﬁtting for the AIS-likelihood. However, such overﬁtting does not exist for pretraining purposes as well. Thus we suggest the pseudo-likelihood, and the AISlikelihood should be viewed as too conservative.

Figure 2: Left: Pseudo-likelihood of the Dropout RBM over 500 pretraining epochs. Right: AIS-likelihood of the Dropout RBM over 500 pretraining epochs.

Classiﬁcation error rates tend to be a more practical measure. We ﬁrst consider a 3-hidden-layer DBN with 1,000 nodes per layer, pretraining learning rate 0.01, and batch size 10; see Table 1. We tried DBNs of 1, 2, and 4 hidden layers and found the aforementioned structure to perform best with None as baseline. The same was done for all other structures. We calculate the means of the classiﬁcation errors for each regularization method averaged over 5 random replicates and their standard deviations. In each table, we stress in bold the top 3 performers with ties broken by deviation. We note that most of the regularization methods tend to improve the classiﬁcation error rates, with DC and PDO yielding slightly higher error rates than no regularization.

None DO DC L2 L2 + AL1

m. 1.35% 1.30% 1.37% 1.31% 1.35% sd. 0.02% 0.04% 0.02% 0.03% 0.01% SNP INP PDO PDC m. 1.30% 1.32% 1.36% 1.30% sd. 0.03% 0.03% 0.04% 0.03%

Table 1: Classiﬁcation errors for a 3-layer DBN for the MNIST data set.

In Table 2, we consider a 3-hidden-layer DBM with 1,000 nodes per layer. For simplicity, we only classify based on the original features. We let the pretraining learning rate be 0.03 and the batch size be 10.

None DO DC L2 L2 + AL1

m. 1.22% 1.21% 1.20% 1.14% 1.15% sd. 0.02% 0.02% 0.02% 0.02% 0.04% SNP INP PDO PDC m. 1.18% 1.26% 1.21% 1.12% sd. 0.03% 0.02% 0.03% 0.02%

Table 2: Classiﬁcation errors for a 3-layer DBM for the MNIST data set.

It can be observed that regularization tends to yield more improvement for DBM than DBN, possibly because a DBM doubles both the visible layer and the third hidden layer, resulting in a larger neural network structure in general. Only INP proves to be unsuitable for the DBM; all other regularization methods work better, with PDC being the best.

6.3 The NORB Data Set The NORB data set has 5 categories of images of 3D objects. There are 24,300 training examples, with 2,300 validation examples held out, and 24,300 testing examples. We follow preprocessing of Nair and Hinton (2009), and apply a sparse two-hidden-layer DBN with 4,000 nodes per layer as in Lee, Ekanadham, and Ng (2007) with a sparsity regularization coefﬁcient of 10.0 and the ﬁrst hidden layer being a Gaussian RBM. The pretraining learning rates are 0.001 and 0.01 for the ﬁrst and second hidden layer, and the batch sizes for pretraining and ﬁnetuning are 100 and 20. Because the validation error often goes to zero, we choose the 300-th epoch and ﬁx the regularization parameters as follows based on the best values of other data sets: p = 0.9 for DO/DC/SNP/INP, λ = 10 4 for L2, μ = 0.1 for L2+AL1, (p0, q) = (0.5, 0.8) for PDO and (p0, q) = (0.8, 0.5) for PDC. In Table 3, only weight decay and PDO/PDC perform better than None, with PDC again being the best.

None DO DC L2 L2 + AL1

m. 11.00% 11.15% 11.19% 10.93% 10.91% sd. 0.15% 0.12% 0.10% 0.18% 0.17% SNP INP PDO PDC m. 11.04% 11.14% 10.95% 10.81% sd. 0.18% 0.20% 0.15% 0.13%

Table 3: Classiﬁcation errors for the NORB data set.

6.4 The 20 Newsgroups Data Set The 20 Newsgroups data set is a collection of news documents with 20 categories. There are 11,293 training examples, from which 6,293 validation examples are randomly held out, and 7,528 testing examples. We adopt the stemmed version, retain the most common 5,000 words, and train an RSM with 1,000 hidden nodes in a single layer. We consider this as a simple case of deep learning since it is a twostep procedure. The pretraining learning rate is 0.02 and the

batch size is 50. We apply logistic regression to classify the trained features, i.e. hidden values of the RSM, as in Srivastava, Salakhutdinov, and Hinton (2013). This setting is quite challenging for unsupervised neural networks. In Table 4, Dropout performs best with other regularization methods yielding improvements except Drop Connect.

None DO DC L2 L2 + AL1

m. 30.8% 28.8% 35.2% 30.1% 30.1% sd. 0.70% 0.23% 0.91% 0.30% 0.65% SNP INP PDO PDC m. 29.7% 29.7% 30.1% 29.7% sd. 0.26% 0.48% 0.71% 0.34%

Table 4: Classiﬁcation errors for the trained features of RSM for the 20 Newsgroups data set.

6.5 The Reuters21578 Data Set The Reuters21578 data set is a collection of newswire articles. We adopt the stemmed R-52 version which has 52 categories, 6,532 training examples, from which 1,032 validation examples are randomly held out, and 2,568 testing examples. We retain the most common 2,000 words, and train an RSM with 500 hidden nodes in a single layer. The pretraining learning rate is 0.1 and the batch size is 50. We make the learning rate large because the cost function is quite bumpy. From Table 5, we note that PDC works best, and PDO improves the performance of Dropout.

None DO DC L2 L2 + AL1

m. 10.50% 11.91% 10.10% 10.06% 9.99% sd. 0.64% 0.70% 0.32% 0.28% 0.41% SNP INP PDO PDC m. 9.99% 10.10% 9.98% 9.84% sd. 0.27% 0.30% 0.24% 0.23%

Table 5: Classiﬁcation errors for the trained features from RSM and the Reuters21578 data set.

6.6 The ISOLET Data Set The ISOLET data set consists of voice recordings of the Latin alphabet (a-z). There are 6,138 training examples, from which 638 validation examples are randomly held out, and 1,559 testing examples. We train a 1,000-hidden-node Gaussian RBM with pretraining learning rate 0.005, batch size 20, and initialize a FFNN, which can be viewed as a single-hidden-layer DBN. From Table 6, it is evident that all regularization methods work better then None, with PDC again being the best.

None DO DC L2 L2 + AL1

m. 3.98% 3.87% 3.88% 3.83% 3.86% sd. 0.09% 0.06% 0.11% 0.10% 0.07% SNP INP PDO PDC m. 3.86% 3.86% 3.96% 3.78% sd. 0.07% 0.10% 0.08% 0.05%

Table 6: Classiﬁcation errors for the ISOLET data set.

6.7 Summary From the above results, we observe that regularization does improve the structure of unsupervised deep neural networks and yields lower classiﬁcation error rates for each data set studied herein. The most robust methods which yield improvements for all six instances are L2, L2+AL1, and PDC. SNP is also acceptable, and preferable over INP. PDO can yield improvements for Dropout when Dropout is unsuitable for the network structure. PDC turns out to be the most stable method of all, and thus the recommended choice.

7 Conclusion Regularization for deep learning has aroused much interest, and in this paper, we extend regularization to unsupervised deep learning, i.e. for DBNs and DBMs. We proposed several approaches, demonstrated their performance, and empirically compared the different techniques. For the future, we suggest that it would be of interest to consider more variants of model averaging regularization for supervised deep learning as well as novel methods of unsupervised learning; for instance, Kingma and Welling (2015) provided an interesting variational Bayesian auto-encoder approach.

References Ba, L., and Frey, B. 2013. Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems 26. MIT Press. Baldi, P., and Sadowski, P. 2013. Understanding dropout. In Advances in Neural Information Processing Systems 26. MIT Press. Bengio, Y. 2007. Learning deep architectures for ai. https://www.iro.umontreal.ca/ lisa/pointeurs/TR1312.pdf. Cho, K.; Ilin, A.; and Taiko, T. 2012. Tikhonov-type regularization for restricted boltzmann machines. In 22nd International Conference on Artiﬁcial Neural Networks. Goh, H.; Thome, N.; Cord, M.; and Lim, J. 2013. Topdown regularization of deep belief networks. In Advances in Neural Information Processing Systems 26. MIT Press. Hinton, G.; Osindero, S.; and Teh, Y. 2006. A fast learning algorithm for deep belief nets. Neural Computation 18:1527 1554. Hinton, G. 2010. A practical guide to training restricted boltzmann machines. https://www.cs.toronto.edu/ hinton/ absps/guide TR.pdf. Kang, G.; Li, J.; and Tao, D. 2016. Shakeout: A new regularized deep neural network training scheme. In 30th AAAI Conference on Artiﬁcial Intelligence. Kingma, D., and Welling, M. 2014. Auto-encoding variational bayes. In International Conference on Learning Representations. Kingma, D.; Salimans, T.; and Welling, M. 2015. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems 28. Lee, H.; Ekanadham, C.; and Ng, A. 2007. Sparse deep belief net model for visual area v2. In Advances in Neural Information Processing Systems 20. MIT Press.

Nair, V., and Hinton, G. 2009. 3d object recognition with deep belief nets. In Advances in Neural Information Processing Systems 22. MIT Press. Reed, R. 1993. Pruning algorithms: a survey. IEEE Transactions on Neural Networks 4:740 747. Salakhutdinov, R., and Hinton, G. 2009a. Deep boltzmann machines. In 12th International Conference on Artiﬁcial Intelligence and Statistics. Salakhutdinov, R., and Hinton, G. 2009b. Replicated softmax: an undirected topic model. In Advances in Neural Information Processing Systems 22. MIT Press. Salakhutdinov, R., and Murray, I. 2008. On the quantitative analysis of deep belief networks. In 25th International Conference on Machine Learning. Salakhutdinov, R.; Mnih, A.; and Hinton, G. 2007. Restricted boltzmann machines for collaborative ﬁltering. In 24th International Conference on Machine Learning. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: A simple way to prevent neural networks from overﬁtting. Journal of Machine Learning Research 14:1929 1958. Srivastava, N.; Salakhutdinov, R.; and Hinton, G. 2013. Modeling documents with a deep boltzmann machine. In 29th Conference on Uncertainty in Artiﬁcial Intelligence. Tieleman, T. 2008. Training restricted boltzmann machines using approximations to the likelihood gradient. In 25th International Conference on Machine Learning. Wager, S.; Wang, S.; and Liang, P. 2013. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems 26. MIT Press. Wan, L.; Zeiler, M.; Zhang, S.; Le Cun, Y.; and Fergus, R. 2013. Regularization of neural networks using dropconnect. In 30th International Conference on Machine Learning. Wang, B., and Klabjan, D. 2016. Supplementary material for regularization for unsupervised deep neural nets . http://www.dynresmanagement.com/publications.html. Wang, S., and Manning, C. 2013. Fast dropout training. In 30th International Conference on Machine Learning. Witten, I.; Frank, E.; and Hall, M. 2011. Data Mining: Practical Machine Learning Tools and Techniques. Burlington, Massachusetts, USA: Morgan Kaufmann Publishers. Zou, H., and Hastie, T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67:301320. Zou, H. 2006. The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101:1418 1429.