# regularization_for_unsupervised_deep_neural_nets__cc3bcf54.pdf Regularization for Unsupervised Deep Neural Nets Baiyang Wang, Diego Klabjan Department of Industrial Engineering and Management Sciences, Northwestern University, 2145 Sheridan Road, C210 Evanston, Illinois 60208 Unsupervised neural networks, such as restricted Boltzmann machines (RBMs) and deep belief networks (DBNs), are powerful tools for feature selection and pattern recognition tasks. We demonstrate that overfitting occurs in such models just as in deep feedforward neural networks, and discuss possible regularization methods to reduce overfitting. We also propose a partial approach to improve the efficiency of Dropout/Drop Connect in this scenario, and discuss the theoretical justification of these methods from model convergence and likelihood bounds. Finally, we compare the performance of these methods based on their likelihood and classification error rates for various pattern recognition data sets. 1 Introduction Unsupervised neural networks assume unlabeled data to be generated from a neural network structure, and have been applied extensively to pattern analysis and recognition. The most basic one is the restricted Boltzmann machine (RBM) (Salakhutdinov, Mnih, and Hinton 2007), an energy-based model with a layer of hidden nodes and a layer of visible nodes. With such a basic structure, we can stack multiple layers of RBMs to create an unsupervised deep neural network structure, such as the deep belief network (DBN) and the deep Boltzmann machine (DBM) (Hinton, Osindero, and Teh 2006; Salakhutdinov and Hinton 2009a). These models can be calibrated with a combination of the stochastic gradient descent and the contrastive divergence (CD) algorithm or the PCD algorithm (Salakhutdinov, Mnih, and Hinton 2007; Tieleman 2008). Once we learn the parameters of a model, we can retrieve the values of the hidden nodes from the visible nodes, thus applying unsupervised neural networks for feature selection. Alternatively, we may consider applying the parameters obtained from an unsupervised deep neural network to initialize a deep feedforward neural network (FFNN), thus improving supervised learning. One essential question for such models is to adjust for the high-dimensionality of their parameters and avoid overfitting. In FFNNs, the simplest regularization is arguably the early stopping method, which stops the gradient descent algorithm before the validation error rate goes up. The weight Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. decay method, or Ls regularization, is also commonly used (Witten, Frank, and Hall 2011). Recently Dropout is proposed, which optimizes the parameters over an average of exponentially many models with a subset of all nodes (Srivastava et al. 2014). It has been shown to outperform weight decay regularization in many situations. For regularizing unsupervised neural networks, sparse RBM-type models encourage a smaller proportion of 1valued hidden nodes (Cho, Ilin, and Taiko 2012; Lee, Ekanadham, and Ng 2007). DBNs are regularized in Goh et al. (2013) with outcome labels. While these works tend to be goal-specific, we consider regularization for unsupervised neural networks in a more general setting. Our work and contributions are as follows: (1) we extend common regularization methods to unsupervised deep neural networks, and explain their underlying mechanisms; (2) we propose partial Dropout/Drop Connect which can improve the performance of Dropout/Drop Connect; (3) we compare the performance of different regularization methods on real data sets, thus providing suggestions on regularizing unsupervised neural networks. We note that this is the very first study illustrating the mechanisms of various regularization methods for unsupervised neural nets with model convergence and likelihood bounds, including the effective newly proposed partial Dropout/Drop Connect. Section 2 reviews recent works for regularizing neural networks, and Section 3 exhibits RBM regularization as a basis for regularizing deeper networks. Section 4 discusses the model convergence of each regularization method. Section 5 extends regularization to unsupervised deep neural nets. Section 6 presents a numerical comparison of different regularization methods on RBM, DBN, DBM, RSM (Salakhutdinov and Hinton 2009b) and Gaussian RBM (Salakhutdinov, Mnih, and Hinton 2007). Section 7 discusses potential future research and concludes the paper. 2 Related Works To begin with, we consider a simple FFNN with a single layer of input ı = (ı1, . . . , ıI)T and a single layer of output o = (o1, . . . , o J)T {0, 1}J. The weight matrix W is of size J I. We assume the relation E(o) = a(W ı), (1) Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) where a( ) is the activation function, such as the sigmoid function σ(x) = 1/(1 + e x) applied element-wise. Equation (1) has the modified form in Srivastava et al. (2014), E(o|m) = a(m (W ı)), m = (m1, . . . , m J)T iid Ber(p), (2) where denotes element-wise multiplication, and Ber( ) denotes the Bernoulli distribution, thereby achieving the Dropout (DO) regularization for neural networks. In Dropout, we minimize the objective function n=1 Em[log p(o(n)|ı(n), W, m)], (3) which can be achieved by a stochastic gradient descent algorithm, sampling a different mask m per data example (o(n), ı(n)) and per iteration. We observe that this can be readily extended to deep FFNNs. Dropout regularizes neural networks because it incorporates prediction based on any subset of all the nodes, therefore penalizing the likelihood. A theoretical explanation is provided in Wager, Wang, and Liang (2013) for Dropout, noting that it can be viewed as feature noising for GLMs, and we have the relation n=1 log p(o(n)|ı(n), W) + Rq(W). (4) Here J = 1 for simplicity, and Rq(W) = 1 2 p 1 p N n=1 I i=1 A (Wı(n))(ı(n) i )2W 2 i , where A( ) is the log-partition function of a GLM. Therefore, Dropout can be viewed approximately as the adaptive L2 regularization (Baldi and Sadowski 2013; Wager, Wang, and Liang 2013). A recursive approximation of Dropout is provided in Baldi and Sadowski (2013) using normalized weighted geometric means to study its averaging properties. An intuitive extension of Dropout is Drop Connect (DC) (Wan et al. 2013), which has the form below E(o|m) = a((m W) ı), m = (mij)J I iid Ber(p), (5) and thus masks the weights rather than the nodes. The objective l DC(W) has the same form as in (3). There are a number of related model averaging regularization methods, each of which averages over subsets of the original model. For instance, Standout varies Dropout probabilities for different nodes which constitute a binary belief network (Ba and Frey 2013). Shakeout adds additional noise to Dropout so that it approximates elastic-net regularization (Kang, Li, and Tao 2016). Fast Dropout accelerates Dropout with Gaussian approximation (Wang and Manning 2013). Variational Dropout applies variational Bayes to infer the Dropout function (Kingma, Salimans, and Welling 2015). We note that while Dropout has been discussed for RBMs (Srivastava et al. 2014), to the best of our knowledge, there is no literature extending common regularization methods to RBMs and unsupervised deep neural networks; for instance, adaptive Ls regularization and Drop Connect as mentioned. Therefore, below we discuss their implementations and examine their empirical performance. In addition to studying model convergence and likelihood bounds, we propose partial Dropout/Drop Connect which iteratively drops a subset of nodes or edges based on a given calibrated model, therefore improving robustness in many situations. 3 RBM Regularization For a Restricted Boltzmann machine, we assume that v = (v1, , v J)T {0, 1}J denotes the visible vector, and h = (h1, , h I)T {0, 1}I denotes the hidden vector. Each vj, j = 1, . . . , J is a visible node and each hi, i = 1, . . . , I is a hidden node. The joint probability is P(v, h) = e E(v,h)/ ν,η e E(ν,η), E(v, h) = b T v c T h h T Wv. (6) We let the parameters ϑ = (b, c, W) Θ, which is a vector containing all components of b, c, and W. To calibrate the model is to find ˆθ = arg max ϑ Θ N n=1 log P(v(n)|ϑ). An RBM is a neural network because we have the following conditional probabilities P(hi = 1|v) = σ(ci + Wi v), P(vj = 1|h) = σ(bj + W T j h), (7) where Wi and W j represent, respectively, the i-th row and j-th column of W. The gradient descent algorithm is applied to calibration. The gradient of the log-likelihood can be expressed in the following form log P(v(n)) ϑ = F(v(n)) v {0,1}J P(v) F(v) where F(v) = b T v I i=1 log(1 + eci+Wi v) is the free energy. The right-hand side of (8) is approximated by contrastive divergence with k steps of Gibbs sampling (CD-k) (Salakhutdinov, Mnih, and Hinton 2007). 3.1 Weight Decay Regularization Weight decay, or Ls regularization, adds the term λ W s s to the negative log-likelihood of an RBM. The most commonly used is L2 (ridge regression), or L1 (LASSO). In all situations, we do not regularize biases for simplicity. Here we consider a more general form. Suppose we have a trained set of weights W from CD with no regularization. Instead of adding the term λ W s s, we add the term μ IJ i,j |Wij|s/| ˆWij|s to the negative log-likelihood. Apparently this adjusts for the different scales of the components of W. We refer to this approach as adaptive Ls. We note that adaptive L1 is the adaptive LASSO (Zou 2006), and adaptive L2 plus L1 is the elastic-net (Zou and Hastie 2005). We consider the performance of L2 regularization plus adaptive L1 regularization (L2 + AL1) below. 3.2 Model Averaging Regularization As discussed in Srivastava et al. (2014), to characterize a Dropout (DO) RBM, we simply need to apply the following conditional distributions PDO(hi = 1|v, m) = mi σ(ci + Wi v), PDO(vj = 1|h, m) = σ(bj + W T j h). (9) Therefore, given a fixed mask m {0, 1}I, we actually obtain an RBM with all visible nodes v and hidden nodes {hi : mi = 1}. Hidden nodes {hi : mi = 0} are fixed to zero so they have no influence on the conditional RBM. Apart from replacing (7) with (9), the only other change needed is to replace F(v) with FDO(v|m) = b T v I i=1 mi log(1 + eci+Wi v). In terms of training, we suggest sampling a different mask per data example v(n) and per iteration as in Srivastava et al. (2014). A Drop Connect (DC) RBM is closely related; given a mask m = {0, 1}IJ on weights W, W in a plain RBM is replaced by m W everywhere. We suggest sampling a different mask m per mini-batch since it is usually much larger than a mask in a Dropout RBM. 3.3 Network Pruning Regularization There are typically many nodes or weights which are of little importance in a neural network. In network pruning, such unimportant nodes or weights are discarded, and the neural network is retrained. This process can be conducted iteratively (Reed 1993). Now we consider two variants of network pruning for RBMs. For an trained set of weights ˆW with no regularization, we consider implementing a fixed mask m = (mij)I J where mij = 1| ˆ Wij| Q, Q = Q100(1 p)%(| ˆW|), (10) i.e. Q is the 100(1 p)%-th left percentile of all | ˆWij|, and p (0, 1) is some fixed proportion of retained weights. We then recalibrate the weights and biases fixing mask m, leading to a simple network pruning (SNP) procedure which deletes 100(1 p)% of all weights. We may also consider deleting 100(1 p)/r% of all weights at a time, and conduct the above process r times, leading to an iterative network pruning (INP) procedure. 3.4 Hybrid Regularization We may consider combining some of the above approaches. For instance, Srivastava et al. (2014) considered a combination of Ls and Dropout. We introduce two new hybrid approaches, namely partial Drop Connect (PDC) presented in Algorithm 1 and partial Dropout (PDO), which generalizes Drop Connect and Dropout, and borrows from network pruning. The rationale comes from some of the model convergence results exhibited later. As before, suppose we have a trained set of weights ˆW with no regularization. Instead of implementing a fixed mask m, we perform Drop Connect regularization with different retaining probabilities pij for each weight Wij. We let the quantile Q = Q100(1 q)%(| ˆW|), and pij = 1| ˆ Wij| Q + p0 1| ˆ Wij|