# dropout_with_expectationlinear_regularization__59cd14fa.pdf Published as a conference paper at ICLR 2017 DROPOUT WITH EXPECTATION-LINEAR REGULARIZATION Xuezhe Ma, Yingkai Gao Language Technologies Institute Carnegie Mellon University {xuezhem, yingkaig}@cs.cmu.edu Zhiting Hu, Yaoliang Yu Machine Learning Department Carnegie Mellon University {zhitinghu, yaoliang}@cs.cmu.edu Yuntian Deng School of Engineering and Applied Sciences Harvard University dengyuntian@gmail.com Eduard Hovy Language Technologies Institute Carnegie Mellon University hovy@cmu.edu Dropout, a simple and effective way to train deep neural networks, has led to a number of impressive empirical successes and spawned many recent theoretical investigations. However, the gap between dropout s training and inference phases, introduced due to tractability considerations, has largely remained under-appreciated. In this work, we first formulate dropout as a tractable approximation of some latent variable model, leading to a clean view of parameter sharing and enabling further theoretical analysis. Then, we introduce (approximate) expectation-linear dropout neural networks, whose inference gap we are able to formally characterize. Algorithmically, we show that our proposed measure of the inference gap can be used to regularize the standard dropout training objective, resulting in an explicit control of the gap. Our method is as simple and efficient as standard dropout. We further prove the upper bounds on the loss in accuracy due to expectation-linearization, describe classes of input distributions that expectation-linearize easily. Experiments on three image classification benchmark datasets demonstrate that reducing the inference gap can indeed improve the performance consistently. 1 INTRODUCTION Deep neural networks (DNNs, e.g., Le Cun et al., 2015; Schmidhuber, 2015), if trained properly, have been demonstrated to significantly improve the benchmark performances in a wide range of application domains. As neural networks go deeper and deeper, naturally, its model complexity also increases quickly, hence the pressing need to reduce overfitting in training DNNs. A number of techniques have emerged over the years to address this challenge, among which dropout (Hinton et al., 2012; Srivastava, 2013) has stood out for its simplicity and effectiveness. In a nutshell, dropout randomly drops neural units during training as a means to prevent feature co-adaptation a sign of overfitting (Hinton et al., 2012). Simple as it appears to be, dropout has led to several record-breaking performances (Hinton et al., 2012; Ma & Hovy, 2016), and thus spawned a lot of recent interests in analyzing and justifying dropout from the theoretical perspective, and also in further improving dropout from the algorithmic and practical perspective. In their pioneering work, Hinton et al. (2012) and Srivastava et al. (2014) interpreted dropout as an extreme form of model combination (aka. model ensemble) with extensive parameter/weight sharing, and they proposed to learn the combination through minimizing an appropriate expected loss. Interestingly, they also pointed out that for a single logistic neural unit, the output of dropout is in fact the geometric mean of the outputs of the model ensemble with shared parameters. Subsequently, many theoretical justifications of dropout have been explored, and we can only mention a few here due to space limits. Building on the weight sharing perspective, Baldi & Sadowski (2013; 2014) analyzed the ensemble averaging property of dropout in deep non-linear logistic networks, and supported the view that dropout is equivalent to applying stochastic gradient descent on some regularized Published as a conference paper at ICLR 2017 loss function. Wager et al. (2013) treated dropout as an adaptive regularizer for generalized linear models (GLMs). Helmbold & Long (2016) discussed the differences between dropout and traditional weight decay regularization. In terms of statistical learning theory, Gao & Zhou (2014) studied the Rademacher complexity of different types of dropout, showing that dropout is able to reduce the Rademacher complexity polynomially for shallow neural networks (with one or no hidden layers) and exponentially for deep neural networks. This latter work (Gao & Zhou, 2014) formally demonstrated that dropout, due to its regularizing effect, contributes to reducing the inherent model complexity, in particular the variance component in the generalization error. Seen as a model combination technique, it is intuitive that dropout contributes to reducing the variance of the model performance. Surprisingly, dropout has also been shown to play some role in reducing the model bias. For instance, Jain et al. (2015) studied the ability of dropout training to escape local minima, hence leading to reduced model bias. Other studies (Chen et al., 2014; Helmbold & Long, 2014; Wager et al., 2014) focus on the effect of the dropout noise on models with shallow architectures. We noted in passing that there are also some work (Kingma et al., 2015; Gal & Ghahramani, 2015; 2016) trying to understand dropout from the Bayesian perspective. In this work, we first formulate dropout as a tractable approximation of a latent variable model, and give a clean view of weight sharing ( 3). Then, we focus on an inference gap in dropout that has somehow gotten under-appreciated: In the inference phase, for computational tractability considerations, the model ensemble generated by dropout is approximated by a single model with scaled weights, resulting in a gap between training and inference, and rendering the many previous theoretical findings inapplicable. In general, this inference gap can be very large and no attempt (to our best knowledge) has been made to control it. We make three contributions in bridging this gap: Theoretically, we introduce expectation-linear dropout neural networks, through which we are able to explicitly quantify the inference gap ( 4). In particular, our theoretical results explain why the max-norm constraint on the network weights, a standard practice in training DNNs, can lead to a small inference gap hence potentially improve performance. Algorithmically, we propose to add a sampled version of the inference gap to regularize the standard dropout training objective (expectationlinearization), hence allowing explicit control of the inference gap, and analyze the interaction between expectation-linearization and the model accuracy ( 5). Experimentally, through three benchmark datasets we show that our regularized dropout is not only as simple and efficient as standard dropout but also consistently leads to improved performance ( 6). 2 DROPOUT NEURAL NETWORKS In this section we set up the notations, review the dropout neural network model, and discuss the inference gap in standard dropout training that we will attempt to study in the rest of the paper. 2.1 DNNS AND NOTATIONS Throughout we use uppercase letters for random variables (and occasionally for matrices as well), and lowercase letters for realizations of the corresponding random variables. Let X X be the input of the neural network, Y Y be the desired output, and D = {(x1, y1), . . . , (x N, y N)} be our training sample, where xi, i = 1, . . . , N, (resp. yi) are usually i.i.d. samples of X (resp. Y ). Let M denote a deep neural network with L hidden layers, indexed by l {1, . . . , L}. Let h(l) denote the output vector from layer l. As usual, h(0) = x is the input, and h(L) is the output of the neural network. Denote θ = {θl : l = 1, . . . , L} as the set of parameters in the network M, where θl assembles the parameters in layer l. With dropout, we need to introduce a set of dropout random variables S = {Γ(l) : l = 1, . . . , L}, where Γ(l) is the dropout random variable for layer l. Then the deep neural network M can be described as: h(l) = fl(h(l 1) γ(l); θl), l = 1, . . . , L, (1) where is the element-wise product and fl is the transformation function of layer l. For example, if layer l is a fully connected layer with weight matrix W, bias vector b, and sigmoid activation function σ(x) = 1 1+exp( x), then fl(x) = σ(Wx + b)). We will also use h(l)(x, s; θ) to denote the output of layer l with input x and dropout value s, under parameter θ. Published as a conference paper at ICLR 2017 In the simplest form of dropout, which is also called standard dropout, Γ(l) is a vector of independent Bernoulli random variables, each of which has probability pl of being 1 and 1 pl of being 0. This corresponds to dropping each of the weights independently with probability pl. 2.2 DROPOUT TRAINING The standard dropout neural networks can be trained using stochastic gradient decent (SGD), with a sub-network sampled by dropping neural units for each training instance in a mini-batch. Forward and backward pass for that training instance are done only on the sampled sub-network. Intuitively, dropout aims at, simultaneously and jointly, training an ensemble of exponentially many neural networks (one for each configuration of dropped units) while sharing the same weights/parameters. The goal of the stochastic training procedure of dropout can be understood as minimizing an expected loss function, after marginalizing out the dropout variables (Srivastava, 2013; Wang & Manning, 2013). In the context of maximal likelihood estimation, dropout training can be formulated as: θ = argmin θ ESD[ l(D, SD; θ)] = argmin θ ESD h i=1 log p(yi|xi, Si; θ) i , (2) where recall that D is the training sample, SD = {S1, . . . , SN} is the dropout variable (one for each training instance), and l(D, SD; θ) is the (conditional) log-likelihood function defined by the conditional distribution p(y|x, s; θ) of output y given input x, under parameter θ and dropout variable s. Throughout we use the notation EZ to denote the conditional expectation where all random variables except Z are conditioned on. Dropout has also been shown to work well with regularization, such as L2 weight decay (Tikhonov, 1943), Lasso (Tibshirani, 1996), KL-sparsity(Bradley & Bagnell, 2008; Hinton, 2010), and max-norm regularization (Srebro et al., 2004), among which the max-norm regularization that constrains the norm of the incoming weight matrix to be bounded by some constant was found to be especially useful for dropout (Srivastava, 2013; Srivastava et al., 2014). 2.3 DROPOUT INFERENCE AND GAP As mentioned before, dropout is effectively training an ensemble of neural networks with weight sharing. Consequently, at test time, the output of each network in the ensemble should be averaged to deliver the final prediction. This averaging over exponentially many sub-networks is, however, intractable, and standard dropout typically implements an approximation by introducing a deterministic scaling factor for each layer to replace the random dropout variable: ES[H(L)(x, S; θ)] ? h(L)(x, E[S]; θ), (3) where the right-hand side is the output of a single deterministic neural network whose weights are scaled to match the expected number of active hidden units on the left-hand side. Importantly, the right-hand side can be easily computed since it only involves a single deterministic network. Bulò et al. (2016) combined dropout with knowledge distillation methods (Hinton et al., 2015) to better approximate the averaging processing of the left-hand side. However, the quality of the approximation in (3) is largely unknown, and to our best knowledge, no attempt has been made to explicitly control this inference gap. The main goal of this work is to explicitly quantify, algorithmically control, and experimentally demonstrate the inference gap in (3), in the hope of improving the generalization performance of DNNs eventually. To this end, in the next section we first present a latent variable model interpretation of dropout, which will greatly facilitate our later theoretical analysis. 3 DROPOUT AS LATENT VARIABLE MODELS With the end goal of studying the inference gap in (3) in mind, in this section, we first formulate dropout neural networks as a latent variable model (LVM) in 3.1. Then, we point out the relation between the training procedure of LVM and that of standard dropout in 3.2. The advantage of formulating dropout as a LVM is that we need only deal with a single model (with latent structure), instead of an ensemble of exponentially many different models (with weight sharing). This much Published as a conference paper at ICLR 2017 simplified view of dropout enables us to understand and analyze the model parameter θ in a much more straightforward and intuitive way. 3.1 AN LVM FORMULATION OF DROPOUT A latent variable model consists of two types of variables: the observed variables that represent the empirical (observed) data and the latent variables that characterize the hidden (unobserved) structure. To formulate dropout as a latent variable model, the input x and output y are regarded as observed variables, while the dropout variable s, representing the sub-network structure, is hidden. Then, upon fixing the input space X, the output space Y, and the latent space S for dropout variables, the conditional probability of y given x under parameter θ can be written as p(y|x; θ) = Z S p(y|x, s; θ)p(s)dµ(s), (4) where p(y|x, s; θ) is the conditional distribution modeled by the neutral network with configuration s (same as in Eq. (2)), p(s) is the distribution of dropout variable S (e.g. Bernoulli), here assumed to be independent of the input x, and µ(s) is the base measure on the space S. 3.2 LVM DROPOUT TRAINING VS. STANDARD DROPOUT TRAINING Building on the above latent variable model formulation (4) of dropout, we are now ready to point out a simple relation between the training procedure of LVM and that of standard dropout. Given an i.i.d. training sample D, the maximum likelihood estimate for the LVM formulation of dropout in (4) is equivalent to minimizing the following negative log-likelihood function: θ = argmin θ l(D; θ) = argmin θ i=1 log p(yi|xi; θ), (5) where p(y|x; θ) is given in Eq. (4). Recall the dropout training objective ESD[ l(D, SD; θ)] in Eq. (2). We have the following theorem as a simple consequence of Jensen s inequality (details in Appendix A): Theorem 1. The expected loss function of standard dropout (Eq. (2)) is an upper bound of the negative log-likelihood of LVM dropout (Eq. (5)): l(D; θ) ESD[ l(D, SD; θ)]. (6) Theorem 1, in a rigorous sense, justifies dropout training as a convenient and tractable approximation of the LVM formulation in (4). Indeed, since directly minimizing the marginalized negative loglikelihood in (5) may not be easy, a standard practice is to replace the marginalized (conditional) likelihood p(y|x; θ) in (4) with its empirical Monte carlo average through drawing samples from the dropout variable S. The dropout training objective in (2) corresponds exactly to this Monte carlo approximation when a single sample Si is drawn for each training instance (xi, yi). Importantly, we note that the above LVM formulation involves only a single network parameter θ, which largely simplifies the picture and facilitates our subsequent analysis. 4 EXPECTATION-LINEAR DROPOUT NEURAL NETWORKS Building on the latent variable model formulation in 3, we introduce in this section the notion of expectation-linearity that essentially measures the inference gap in (3). We then characterize a general class of neural networks that exhibit expectation-linearity, either exactly or approximately over a distribution p(x) on the input space. We start with defining expectation-linearity in the simplest single-layer neural network, then we extend the notion into general deep networks in a natural way. Definition 1 (Expectation-linear Layer). A network layer h = f(x γ; θ) is expectation-linear with respect to a set X X, if for all x X we have E[f(x Γ; θ)] f(x E[Γ]; θ) 2 = 0. (7) In this case we say that X is expectation-linearizable, and θ is expectation-linearizing w.r.t X . Published as a conference paper at ICLR 2017 Obviously, the condition in (7) will guarantee no gap in the dropout inference approximation (3) an admittedly strong condition that we will relax below. Clearly, if f is an affine function, then we can choose X = X and expectation-linearity is trivial. Note that expectation-linearity depends on the network parameter θ and the dropout distribution Γ. Expectation-linearity, as defined in (7), is overly strong: under standard regularity conditions, essentially the transformation function f has to be affine over the set X , ruling out for instance the popular sigmoid or tanh activation functions. Moreover, in practice, downstream use of DNNs are usually robust to small errors resulting from approximate expectation-linearity (hence the empirical success of dropout), so it makes sense to define an inexact extension. We note also that the definition in (7) is uniform over the set X , while in a statistical setting it is perhaps more meaningful to have expectation-linearity on average, since inputs from lower density regions are not going to play a significant role anyway. Taking into account the aforementioned motivations, we arrive at the following inexact extension: Definition 2 (Approximately Expectation-linear Layer). A network layer h = f(x γ; θ) is δ-approximately expectation-linear with respect to a distribution p(x) over X if EX h EΓ f(X Γ; θ)|X f(X E[Γ]; θ) 2 In this case we say that p(x) is δ-approximately expectation-linearizable, and θ is δ-approximately expectation-linearizing. To appreciate the power of cutting some slack from exact expectation-linearity, we remark that even non-affine activation functions often have approximately linear regions. For example, the logistic function, a commonly used non-linear activation function in DNNs, is approximately linear around the origin. Naturally, we can ask whether it is sufficient for a target distribution p(x) to be well-approximated by an approximately expectation-linearizable one. We begin by providing an appropriate measurement of the quality of this approximation. Definition 3 (Closeness, (Andreas et al., 2015)). A distribution p(x) is C-close to a set X X if E h inf x X sup γ S X γ x γ 2 i C, (9) where recall that S is the (bounded) space that the dropout variable lives in. Intuitively, p(x) is C-close to a set X if a random sample from p is no more than a distance C from X in expectation and under the worst dropout perturbation . For example, a standard normal distribution is close to an interval centering at origin ([ α, α]) with some constant C. Our definition of closeness is similar to that in Andreas et al. (2015), who used this notion to analyze self-normalized log-linear models. We are now ready to state our first major result that quantifies approximate expectation-linearity of a single-layered network (proof in Appendix B.1): Theorem 2. Given a network layer h = f(x γ; θ), where θ is expectation-linearizing w.r.t. X X. Suppose p(x) is C-close to X and for all x X, xf(x) op B, where op is the usual operator norm. Then, p(x) is 2BC-approximately expectation-linearizable. Roughly, Theorem 2 states that the input distribution p(x) that place most of its mass on regions close to expectation-linearizable sets are approximately expectation-linearizable on a similar scale. The bounded operator norm assumption on the derivative f is satisfied in most commonly used layers. For example, for a fully connected layer with weight matrix W, bias vector b, and activation function σ, f( ) op = |σ ( )| W op is bounded by W op and the supremum of |σ ( )| (1/4 when σ is sigmoid and 1 when σ is tanh). Next, we extend the notion of approximate expectation-linearity to deep dropout neural networks. Definition 4 (Approximately Expectation-linear Network). A deep neural network with L layers (cf. Eq. (1)) is δ-approximately expectation-linear with respect to p(x) over X if EX h ES H(L)(X, S; θ)|X h(L)(X, E[S]; θ) 2 i < δ. (10) where h(L)(X, E[S]; θ) is the output of the deterministic neural network in standard dropout. Published as a conference paper at ICLR 2017 Lastly, we relate the level of approximate expectation-linearity of a deep neural network to the level of approximate expectation-linearity of each of its layers: Theorem 3. Given an L-layer neural network as in Eq. (1), and suppose that each layer l {1, . . . , L} is δ-approximately expectation-linear w.r.t. p(h(l)), E[Γ(l)] γ, supx fl(x) op B, and E Var[H(l)|X] σ2. Then the network is -approximately expectation-linear with = (Bγ)L 1δ + (δ + Bγσ) 1 (Bγ)L 1 From Theorem 3 (proof in Appendix B.2) we observe that the level of approximate expectationlinearity of the network mainly depends on four factors: the level of approximate expecatationlinearity of each layer (δ), the expected variance of each layer (σ), the operator norm of the derivative of each layer s transformation function (B), and the mean of each layer s dropout variable (γ). In practice, γ is often a constant less than or equal to 1. For example, if Γ Bernoulli(p), then γ = p. According to the theorem, the operator norm of the derivative of each layer s transformation function is an important factor in the level of approximate expectation-linearity: the smaller the operator norm is, the better the approximation. Interestingly, the operator norm of a layer often depends on the norm of the layer s weight (e.g. for fully connected layers). Therefore, adding max-norm constraints to regularize dropout neural networks can lead to better approximate expectation-linearity hence smaller inference gap and the often improved model performance. It should also be noted that when Bγ < 1, the approximation error tends to be a constant when the network becomes deeper. When Bγ = 1, grows linearly with L, and when Bγ > 1, the growth of becomes exponential. Thus, it is essential to keep Bγ < 1 to achieve good approximation, particularly for deep neural networks. 5 EXPECTATION-LINEAR REGULARIZED DROPOUT In the previous section we have managed to bound the approximate expectation-linearity, hence the inference gap in (3), of dropout neural networks. In this section, we first prove a uniform deviation bound of the sampled approximate expectation-linearity measure from its mean, which motivates adding the sampled (hence computable) expectation-linearity measure as a regularization scheme to standard dropout, with the goal of explicitly controlling the inference gap of the learned parameter, hence potentially improving the performance. Then we give the upper bounds on the loss in accuracy due to expectation-linearization, and describe classes of distributions that expectation-linearize easily. 5.1 A UNIFORM DEVIATION BOUND FOR THE SAMPLED EXPECTATION-LINEAR MEASURE We now show that an expectation-linear network can be found by expectation-linearizing the network on the training sample. To this end, we prove a uniform deviation bound between the empirical expectation-linearization measure using i.i.d. samples (Eq. (12)) and its mean (Eq. (13)). Theorem 4. Let H = h(L)(x, s; θ) : θ Θ denote a space of L-layer dropout neural networks indexed with θ, where h(L) : X S R and Θ is the space that θ lives in. Suppose that the neural networks in H satisfy the constraints: 1) x X, x 2 α; 2) l {1, . . . , L}, E(Γ(l)) γ and fl op B; 3) h(L) β. Denote empirical expectation-linearization measure and its mean as: ESi H(L)(Xi, Si; θ) h(L)(Xi, E[Si]; θ) 2, (12) = EX h ES H(L)(X, S; θ) h(L)(X, E[S]; θ) 2 Then, with probability at least 1 ν, we have sup θ Θ | ˆ | < 2αBL(γL/2 + 1) n + β From Theorem 4 (proof in Appendix C.1) we observe that the deviation bound decreases exponentially with the number of layers L when the operator norm of the derivative of each layer s transformation Published as a conference paper at ICLR 2017 function (B) is less than 1 (and the contrary if B 1). Importantly, the square root dependence on the number of samples (n) is standard and cannot be improved without significantly stronger assumptions. It should be noted that Theorem 4 per se does not imply anything between expectation-linearization and the model accuracy (i.e. how well the expectation-linearized neural network actually achieves on modeling the data). Formally studying this relation is provided in 5.3. In addition, we provide some experimental evidences in 6 on how improved approximate expectation-linearity (equivalently smaller inference gap) does lead to better empirical performances. 5.2 EXPECTATION-LINEARIZATION AS REGULARIZATION The uniform deviation bound in Theorem 4 motivates the possibility of obtaining an approximately expectation-linear dropout neural networks through adding the empirical measure (12) as a regularization scheme to the standard dropout training objective, as follows: loss(D; θ) = l(D; θ) + λV (D; θ), (15) where l(D; θ) is the negative log-likelihood defined in Eq. (5), λ > 0 is a regularization constant, and V (D; θ) measures the level of approximate expectation-linearity: V (D; θ) = 1 ESi H(L)(xi, Si; θ) h(L)(xi, E[Si]; θ) 2 2. (16) To solve (15), we can minimize loss(D; θ) via stochastic gradient descent as in standard dropout, and approximate V (D; θ) using Monte carlo: h(L)(xi, si; θ) h(L)(xi, E[Si]; θ) 2 2, (17) where si is the same dropout sample as in l(D; θ) for each training instance in a mini-batch. Thus, the only additional computational cost comes from the deterministic term h(L)(xi, E[Si]; θ). Overall, our regularized dropout (15), in its Monte carlo approximate form, is as simple and efficient as the standard dropout. 5.3 ON THE ACCURACY OF EXPECTATION-LINEARIZED MODELS So far our discussion has concentrated on the problem of finding expectation-linear neural network models, without any concerns on how well they actually perform at modeling the data. In this section, we characterize the trade-off between maximizing data likelihood and satisfying an expectationlinearization constraint. To achieve the characterization, we measure the likelihood gap between the classical maximum likelihood estimator (MLE) and the MLE subject to a expectation-linearization constraint. Formally, given training data D = {(x1, y1), . . . , (xn, yn)}, we define ˆθ = argmin θ Θ l(D; θ) (18) ˆθδ = argmin θ Θ,V (D;θ) δ l(D; θ) (19) where l(D; θ) is the negative log-likelihood defined in Eq. (5), and V (D; θ) is the level of approximate expectation-linearity in Eq. (16). We would like to control the loss of model accuracy by obtaining a bound on the likelihood gap defined as: l(ˆθ, ˆθδ) = 1 n(l(D; ˆθ) l(D; ˆθδ)) (20) In the following, we focus on neural networks with softmax output layer for classification tasks. p(y|x, s; θ) = h(L) y (x, s; θ) = f L(h(L 1)(x, s); η) = eηT y h(L 1)(x,s) P y Y eηT y h(L 1)(x,s) (21) where θ = {θ1, . . . , θL 1, η}, Y = {1, . . . , k} and η = {ηy : y Y}. We claim: Published as a conference paper at ICLR 2017 Theorem 5. Given an L-layer neural network h(L)(x, s; θ) with softmax output layer in (21), where parameter θ Θ, dropout variable s S, input x X and target y Y. Suppose that for every x and s, p(y|x, s; ˆθ) makes a unique best prediction that is, for each x X, s S, there exists a unique y Y such that y = y , ˆηT y h(L 1)(x, s) < ˆηT y h(L 1)(x, s). Suppose additionally that x, s, h(L 1)(x, s; ˆθ) β, and y, p(y|x; ˆθ) > 0. Then l(ˆθ, ˆθδ) c1β2 ˆη 2 δ 2 e c2δ/4β (22) where c1 and c2 are distribution-dependent constants. From Theorem 5 (proof in Appendix C.2) we observe that, at one extreme, distributions closed to deterministic can be expectation-linearized with little loss of likelihood. What about the other extreme distributions as close to uniform distribution as possible ? With suitable assumptions about the form of p(y|x, s; ˆθ) and p(y|x; ˆθ), we can achieve an accuracy loss bound for distributions that are close to uniform: Theorem 6. Suppose that x, s, h(L 1)(x, s; ˆθ) β. Additionally, for each (xi, yi) D, s S, log 1 k log p(yi|xi, s; ˆθ) 1 y Y log p(y|xi, s; ˆθ). Then asymptotically as n : l(ˆθ, ˆθδ) 1 δ 4β ˆη 2 E [KL (p( |X; θ) Unif(Y))] (23) Theorem 6 (proof in Appendix C.3) indicates that uniform distributions are also an easy class for expectation-linearization. The next question is whether there exist any classes of conditional distributions p(y|x) for which all distributions are provably hard to expectation-linearize. It remains an open problem and might be an interesting direction for future work. 6 EXPERIMENTS In this section, we evaluate the empirical performance of the proposed regularized dropout in (15) on a variety of network architectures for the classification task on three benchmark datasets MNIST, CIFAR-10 and CIFAR-100. We applied the same data preprocessing procedure as in Srivastava et al. (2014). To make a thorough comparison and provide experimental evidence on how the expectationlinearization interacts with the predictive power of the learned model, we perform experiments of Monte Carlo (MC) dropout, which approximately computes the final prediction (left-hand side of (3)) via Monte Carlo sampling, w/o the proposed regularizer. In the case of MC dropout, we average m = 100 predictions using randomly sampled configurations. In addition, the network architectures and hyper-parameters for each experiment setup are the same as those in Srivastava et al. (2014), unless we explicitly claim to use different ones. Following previous works, for each data set We held out 10,000 random training images for validation to tune the hyper-parameters, including λ in Eq. (15). When the hyper-parameters are fixed, we train the final models with all the training data, including the validation data. A more detailed description of the conducted experiments can be provided in Appendix D. For each experiment, we report the mean test errors with corresponding standard deviations over 5 repetitions. The MNIST dataset (Le Cun et al., 1998) consists of 70,000 handwritten digit images of size 28 28, where 60,000 images are used for training and the rest for testing. This task is to classify the images into 10 digit classes. For the purpose of comparison, we train 6 neural networks with different architectures. The experimental results are shown in Table 1. 6.2 CIFAR-10 AND CIFAR-100 The CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009) consist of 60,000 color images of size 32 32, drawn from 10 and 100 categories, respectively. 50,000 images are used for training and the Published as a conference paper at ICLR 2017 Table 1: Comparison of classification error percentage on test data with and without using expectationlinearization on MNIST, CIFAR-10 and CIFAR-100, under different network architectures (with standard deviations for 5 repetitions). w.o. EL w. EL Data Architecture Standard MC Standard MC 3 dense,1024,logistic 1.23 0.03 1.06 0.02 1.07 0.02 1.06 0.03 3 dense,1024,relu 1.19 0.02 1.04 0.02 1.03 0.02 1.05 0.03 3 dense,1024,relu+max-norm 1.05 0.03 1.02 0.02 0.98 0.03 1.02 0.02 3 dense,2048,relu+max-norm 1.07 0.02 1.00 0.02 0.94 0.02 0.97 0.03 2 dense,4096,relu+max-norm 1.03 0.02 0.92 0.03 0.90 0.02 0.93 0.02 2 dense,8192,relu+max-norm 0.99 0.02 0.96 0.02 0.87 0.02 0.92 0.03 CIFAR-10 3 conv+2 dense,relu+max-norm 12.82 0.10 12.16 0.12 12.20 0.14 12.21 0.15 CIFAR-100 3 conv+2 dense,relu+max-norm 37.22 0.22 36.01 0.21 36.25 0.12 36.10 0.18 Figure 1: Error rate and empirical expectation-linearization risk relative to λ. rest for testing. The neural network architecture we used for these two datasets has 3 convolutional layers, followed by two fully-connected (dense) hidden layers (again, same as that in Srivastava et al. (2014)). The experimental results are recorded in Table 1, too. From Table 1 we can see that on MNIST data, dropout network training with expectation-linearization outperforms standard dropout on all 6 neural architectures. On CIFAR data, expectation-linearization reduces error rate from 12.82% to 12.20% for CIFAR-10, achieving 0.62% improvement. For CIFAR-100, the improvement in terms of error rate is 0.97% with reduction from 37.22% to 36.25%. From the results we see that with or without expectation-linearization, the MC dropout networks achieve similar results. It illustrates that by achieving expectation-linear neural networks, the predictive power of the learned models has not degraded significantly. Moreover, it is interesting to see that with the regularization, on MNIST dataset, standard dropout networks achieve even better accuracy than MC dropout. It may be because that with expectation-linearization, standard dropout inference achieves better approximation of the final prediction than MC dropout with (only) 100 samples. On CIFAR datasets, MC dropout networks achieve better accuracy than the ones with the regularization. But, obviously, MC dropout requires much more inference time than standard dropout (MC dropout with m samples requires about m times the inference time of standard dropout). 6.3 EFFECT OF REGULARIZATION CONSTANT λ In this section, we explore the effect of varying the hyper-parameter for the expectation-linearization rate λ. We train the network architectures in Table 1 with the λ value ranging from 0.1 to 10.0. Figure 1 shows the test errors obtained as a function of λ on three datasets. In addition, Figure 1, middle and right panels, also measures the empirical expectation-linearization risk ˆ of Eq. (12) with varying λ on CIFAR-10 and CIFAR-100, where ˆ is computed using Monte carlo with 100 independent samples. From Figure 1 we can see that when λ increases, better expectation-linearity is achieved (i.e. ˆ decreases). The model accuracy, however, has not kept growing with increasing λ, showing that in practice considerations on the trade-off between model expectation-linearity and accuracy are needed. Published as a conference paper at ICLR 2017 Table 2: Comparison of test data errors using standard dropout, Monte Carlo dropout, standard dropout with our proposed expectation-linearization, and recently proposed dropout distillation on CIFAR-10 and CIFAR-100 under All Conv, (with standard deviations for 5 repetitions). Data Network Standard MC w. EL Distillation CIFAR-10 All Conv 11.18 0.11 10.58 0.21 10.86 0.08 10.81 0.14 CIFAR-100 All Conv 35.50 0.23 34.43 0.25 35.10 0.13 35.07 0.20 6.4 COMPARISON WITH DROPOUT DISTILLATION To make a thorough empirical comparison with the recently proposed Dropout Distillation method (Bulò et al., 2016), we also evaluate our regularization method on CIFAR-10 and CIFAR-100 datasets with the All Convolutional Network (Springenberg et al., 2014) (All Conv). To facilitate comparison, we adopt the originally reported hyper-parameters and the same setup for training. Table 2 gives the results comparison the classification error percentages on test data under All Conv using standard dropout, Monte Carlo dropout, standard dropout with our proposed expectationlinearization, and recently proposed dropout distillation on CIFAR-10 and CIFAR-100 1. According to Table 2, our proposed expectation-linear regularization method achieves comparable performance to dropout distillation. 7 CONCLUSIONS In this work, we attempted to establish a theoretical basis for the understanding of dropout, motivated by controlling the gap between dropout s training and inference phases. Through formulating dropout as a latent variable model and introducing the notion of (approximate) expectation-linearity, we have formally studied the inference gap of dropout, and introduced an empirical measure as a regularization scheme to explicitly control the gap. Experiments on three benchmark datasets demonstrate that reducing the inference gap can indeed improve the end performance. In the future, we intend to formally relate the inference gap to the generalization error of the underlying network, hence providing further justification of regularized dropout. ACKNOWLEDGEMENTS This research was supported in part by DARPA grant FA8750-12-2-0342 funded under the DEFT program. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DARPA. Jacob Andreas, Maxim Rabinovich, Michael I Jordan, and Dan Klein. On the accuracy of selfnormalized log-linear models. In Advances in Neural Information Processing Systems, pp. 1774 1782, 2015. Pierre Baldi and Peter Sadowski. The dropout learning algorithm. Artificial intelligence, 210:78 122, 2014. Pierre Baldi and Peter J Sadowski. Understanding dropout. In Advances in Neural Information Processing Systems, pp. 2814 2822, 2013. David M Bradley and J Andrew Bagnell. Differential sparse coding. 2008. Samuel Rota Bulò, Lorenzo Porzi, and Peter Kontschieder. Dropout distillation. In Proceedings of The 33rd International Conference on Machine Learning, pp. 99 107, 2016. 1We obtained similar results as that reported in Table 1 of Bulò et al. (2016) on CIFAR-10 corpus, while we cannot reproduce comparable results on CIFAR-100 (around 3% worse) Published as a conference paper at ICLR 2017 Ning Chen, Jun Zhu, Jianfei Chen, and Bo Zhang. Dropout training for support vector machines. In Proceedings Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014. Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Insights and applications. In Deep Learning Workshop, ICML, 2015. Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, 2016. Wei Gao and Zhi-Hua Zhou. Dropout rademacher complexity of deep neural networks. ar Xiv preprint ar Xiv:1402.3811, 2014. David P Helmbold and Philip M Long. On the inductive bias of dropout. ar Xiv preprint ar Xiv:1412.4736, 2014. David P Helmbold and Philip M Long. Fundamental differences between dropout and weight decay in deep networks. ar Xiv preprint ar Xiv:1602.04484, 2016. Geoffrey Hinton. A practical guide to training restricted boltzmann machines. Momentum, 9(1):926, 2010. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. ar Xiv preprint ar Xiv:1207.0580, 2012. Prateek Jain, Vivek Kulkarni, Abhradeep Thakurta, and Oliver Williams. To drop or not to drop: Robustness, consistency and differential privacy properties of dropout. ar Xiv preprint ar Xiv:1503.02031, 2015. Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575 2583, 2015. Alex Krizhevsky. Learning multiple layers of features from tiny images, 2009. Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521:436 444, 2015. Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of ACL-2016, pp. 1064 1074, Berlin, Germany, August 2016. Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85 117, 2015. Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. ar Xiv preprint ar Xiv:1412.6806, 2014. Nathan Srebro, Jason Rennie, and Tommi S Jaakkola. Maximum-margin matrix factorization. In Advances in neural information processing systems, pp. 1329 1336, 2004. Nitish Srivastava. Improving neural networks with dropout. Ph D thesis, University of Toronto, 2013. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929 1958, 2014. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267 288, 1996. Andrey Nikolayevich Tikhonov. On the stability of inverse problems. In Dokl. Akad. Nauk SSSR, volume 39, pp. 195 198, 1943. Published as a conference paper at ICLR 2017 Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization. In Advances in neural information processing systems, pp. 351 359, 2013. Stefan Wager, William Fithian, Sida Wang, and Percy S Liang. Altitude training: Strong bounds for single-layer dropout. In Advances in Neural Information Processing Systems, pp. 100 108, 2014. Sida Wang and Christopher Manning. Fast dropout training. In Proceedings of the 30th International Conference on Machine Learning, pp. 118 126, 2013. Published as a conference paper at ICLR 2017 APPENDIX: DROPOUT WITH EXPECTATION-LINEAR REGULARIZATION A LVM DROPOUT TRAINING VS. STANDARD DROPOUT TRAINING Proof of Theorem 1 ESD[l(D, SD; θ)] = R i=1 p(si) N P i=1 log p(yi|xi, si; θ) dµ(s1) . . . dµ(s N) S p(si) log p(yi|xi, si; θ)dµ(si) Because log( ) is a concave function, from Jensen s Inequality, Z S p(s) log p(y|x, s; θ)dµ(s) log Z S p(s)p(y|x, s; θ)dµ(s) ESD[ l(D, SD; θ)] S p(si)p(yi|xi, si; θ)dµ(si) = l(D; θ). B EXPECTATION-LINEAR DROPOUT NEURAL NETWORKS B.1 PROOF OF THEOREM 2 Proof. Let γ = E[Γ], and A = {x : E[f(x Γ; θ)] f(x γ ; θ) 2 = 0} Let X = argmin x A sup γ S X γ x γ 2, and X = X X . Then, X γ = X γ + X γ In the following, we omit the parameter θ for convenience. Moreover, we denote EΓ f(X Γ; θ) = E f(X Γ; θ)|X From Taylor Series, there exit some X , X X satisfy that f(X Γ) = f(X Γ) + f (X Γ)(X Γ) f(X γ ) = f(X γ ) + f (X γ )(X γ ) where we denote f (x) = ( xf(x))T . Then, EΓ[f(X Γ) f(X γ )] = EΓ[f(X Γ + X Γ) f(X γ + X γ )] = EΓ[f(X Γ) f(X γ ) + f (X Γ)(X Γ) f (X γ )(X γ )] = EΓ[f(X Γ) f(X γ )] + EΓ[f (X Γ)(X Γ) f (X γ )(X γ )] Since X A, we have EΓ[f(X Γ) f(X γ )] = 0. EΓ[f(X Γ) f(X γ )] = EΓ[f (X Γ)(X Γ) f (X γ )(X γ )] = EΓ[(f (X Γ) f (X γ ))(X Γ)] + EΓ[f (X γ )(X Γ X γ )] = EΓ[(f (X Γ) f (X γ ))(X Γ)] Published as a conference paper at ICLR 2017 Then, EΓ[f(X Γ)] f(X γ ) 2 = EΓ[(f (X Γ) f (X γ ))(X Γ)] 2 Since X γ 2 sup γ S X γ 2 = inf x A sup γ S X γ x γ 2, and from Jensen s inequality and property of operator norm, EΓ[(f (X Γ) f (X γ ))(X Γ)] 2 EΓ h f (X Γ) f (X γ ) op X Γ 2 i 2BEΓ h X Γ 2 i 2B inf x A sup γ S X γ x γ 2 Finally we have, EΓ[(f (X Γ) f (X γ ))(X Γ)] 2 2BE inf x A sup γ S X γ x γ 2 B.2 PROOF OF THEOREM 3 Proof. Induction on the number of the layers L. As before, we omit the parameter θ. Initial step: when L = 1, the statement is obviously true. Induction on L: Suppose that the statement is true for neural networks with L layers. Now we prove the case L + 1. From the inductive assumption, we have, EX h ESL H(L)(X, SL) h(L)(X, E[SL]) 2 where SL = {Γ(1), . . . , Γ(L)} is the dropout random variables for the first L layers, and L = (Bγ)L 1δ + (δ + Bγσ) 1 (Bγ)L 1 In addition, the L + 1 layer is δ-approximately expectation-linear, we have: EH(L) h EΓ(L+1) f L+1(H(L) Γ(L+1)) f L+1(H(L) γ(L+1)) 2 Let E[Γ(l)] = γ(l), l {1, . . . , L + 1}, and let H(l) and h(l) be short for H(l)(X, Sl) and h(l)(X, E(Sl)), respectively, when there is no ambiguity. Moreover, we denote ES H(L)(X, S; θ) = ES H(L)(X, S; θ) X for convenience. Then, EX h ESL+1 H(L+1) h(L+1) 2 ESL h EΓ(L+1) f L+1(H(L) Γ(L+1)) f L+1(h(L) γ(L+1)) i +ESL h f L+1(H(L) γ(L+1)) i f L+1(h(L) γ(L+1)) 2 ESL h EΓ(L+1) f L+1(H(L) Γ(L+1)) f L+1(h(L) γ(L+1)) i 2 ESL h f L+1(H(L) γ(L+1)) i f L+1(h(L) γ(L+1)) 2 From Eq. 2 and Jensen s inequality, we have ESL h EΓ(L+1) f L+1(H(L) Γ(L+1)) f L+1(h(L) γ(L+1)) i 2 EH(L) EΓ(L+1) f L+1(H(L) Γ(L+1)) f L+1(h(L) γ(L+1)) 2 Published as a conference paper at ICLR 2017 ESL h f L+1(H(L) γ(L+1)) i f L+1(h(L) γ(L+1)) 2 ESL h f L+1(H(L) γ(L+1)) i f L+1(ESL H(L) γ(L+1)) +f L+1(ESL H(L) γ(L+1)) f L+1(h(L) γ(L+1)) 2 ESL h f L+1(H(L) γ(L+1)) i f L+1(ESL H(L) γ(L+1)) 2 f L+1(ESL H(L) γ(L+1)) f L+1(h(L) γ(L+1)) 2 Using Jensen s inequality, property of operator norm and E Var[H(l)|X] σ2, we have ESL h f L+1(H(L) γ(L+1)) i f L+1(ESL H(L) γ(L+1)) 2 EH(L) f L+1(H(L) γ(L+1)) f L+1(ESL H(L) γ(L+1)) 2 BγEH(L) h H(L) ESL H(L) 2 Bγ EH(L) h H(L) ESL H(L) 2 2 f L+1(ESL H(L) γ(L+1)) f L+1(h(L) γ(L+1)) 2 = BγEX h ESL H(L) h(L) 2 Finally, to sum up with Eq. 3, Eq. 4, , Eq. 5, , Eq. 6, we have EX h ESL+1 H(L+1) h(L+1) 2 δ + Bγσ + Bγ L = (Bγ)Lδ + (δ + Bγσ) 1 (Bγ)L C EXPECTATION-LINEARIZATION C.1 PROOF OF THEOREM 4: UNIFORM DEVIATION BOUND Before proving Theorem 4, we first define the notations. Let Xn = {X1, . . . , Xn} be a set of n samples of input X. For a function space F : X R, we use Radn(F, Xn) to denote the empirical Rademacher complexity of F, Radn(F, Xn) = Eσ i=1 σif(Xi) and the Rademacher complexity is defined as Radn(F) = EXn h Radn(F, Xn) i In addition, we import the definition of dropout Rademacher complexity from Gao & Zhou (2014): Rn(H, Xn, Sn) = Eσ i=1 σih(Xi, Si) Rn(H) = EXn,Sn h Radn(H, Xn, Sn) i Published as a conference paper at ICLR 2017 where H : X S R is a function space defined on input space X and dropout variable space S. Rn(H, Xn, Sn) and Rn(H) are the empirical dropout Rademacher complexity and dropout Rademacher complexity, respectively. We further denote Rn(H, Xn) = ESn h Radn(H, Xn, Sn) i . Now, we define the following function spaces: F = f(x; θ) : f(x; θ) = ES h H(L)(x, S; θ) i , θ Θ G = g(x; θ) : g(x; θ) = h(L)(x, E[S]; θ), θ Θ H = h(x, s; θ) : h(x, s; θ) = h(L)(x, s; θ), θ Θ Then, the function space of v(x) = f(x) g(x) is V = {f(x) g(x) : f F, g G}. Lemma 7. Radn(F, Xn) Rn(H, Xn) Rn(H, Xn) = ESn h Radn(H, Xn, Sn) i = ESn Eσ h sup h H i=1 σih(Xi, Si) i ESn h sup h H i=1 σih(Xi, Si) i sup h H ESn h 1 n i=1 σih(Xi, Si) i i=1 σi ESi h(Xi, Si) i=1 σi ESi H(L)(Xi, Si; θ) = Radn(F, Xn) From Lemma 7, we have Radn(F) Rn(H). Lemma 8. Rn(H) αBLγL/2 n Radn(G) αBL Proof. See Theorem 4 in Gao & Zhou (2014). Now, we can prove Theorem 4. Proof of Theorem 4 Proof. From Rademacher-based uniform bounds theorem, with probability 1 δ, sup v V | ˆ | < 2Radn(V) + β Since V = F G, we have Radn(V) = Radn(F G) Radn(F) + Radn(G) αBL(γL/2 + 1) n Then, finally, we have that with probability 1 δ, sup θ Θ | ˆ | < 2αBL(γL/2 + 1) n + β Published as a conference paper at ICLR 2017 C.2 PROOF OF THEOREM 5: NON-UNIFORM BOUND OF MODEL ACCURACY For convenience, we denote λ = {θ1, . . . , θL 1}. Then θ = {λ, η}, and MLE ˆθ = {ˆλ, ˆη} Lemma 9. f L( ; η)T op 2 η 2 (7) Proof. denote A = f L( ; η)T = py(ηy η)T k where py = p(y|x, s; θ), η = E [ηY ] = k P For each v such that v 2 = 1, py (ηy η)T v 2 P y Y py (ηy η) 2 2 v 2 2 = P y Y py (ηy η) 2 2 y Y py ηy η 2 2 P y Y py ηy 2 2 y Y py ηy 2 2 4 η 2 2 So we have A op 2 η 2. Lemma 10. If parameter θ = {ˆλ, η} satisfies that η 2 δ 4β , then V (D; θ) δ, where V (D; θ) is defined in Eq. (16). Proof. Let SL = {Γ(1), . . . , Γ(L)}, and let H(l) and h(l) be short for H(l)(X, Sl; θ) and h(l)(X, E(Sl); θ), respectively. From lemma 9, we have f L(x; η) f L(y; η) 2 2 η 2 x y 2. Then, ESL HL h L 2 = ESL 1 f L(H(L 1); η) f L(h(L 1); η) 2 ESL 1 f L(H(L 1); η) f L(h(L 1); η) 2 2 η 2 H(L 1) h(L 1) 2 4β η 2 δ Lemma 10 says that we can get θ satisfying the expectation-linearization constrain by explicitly scaling down ˆη while keeping ˆλ. In order to prove Theorem 5, we make the following assumptions: The dimension of h(L 1) is d, i.e. h(L 1) Rd. Since y Y, p(y|x; ˆθ) > 0, we assume p(y|x; ˆθ) 1/b, where b |Y| = k. As in the body text, let p(y|x, s; ˆθ) be nonuniform, and in particular let ˆηT y h(L 1)(x, s; ˆλ) ˆηT y h(L 1)(x, s; ˆλ) > c ˆη 2, y = y . For convenience, we denote ηT h(L 1)(x, s; λ) = ηT uy(x, s; λ), where u T y (x, s; λ) = (v T 0 , . . . , v T k ) and vi = h(L 1)(x, s; λ) if i = y 0 otherwise To prove Theorem 5, we first prove the following lemmas. Lemma 11. If p(y|x; ˆθ) 1/b, then α [0, 1], for parameter θ = {ˆλ, αˆη}, we have p(y|x; θ) 1 Published as a conference paper at ICLR 2017 Proof. We define f(α) = (y|x, s; θ) = eαηT y h(L 1)(x,s;ˆλ) P y Y eαηT y h(L 1)(x,s;ˆλ) = eηT y h(L 1)(x,s;ˆλ) α eηT y h(L 1)(x,s;ˆλ) α Since Y = {1, . . . , k}, for fixed x X, s S, log f(α) is a concave function w.r.t α. Since b k, we have log f(α) (1 α) log f(0) + α log f(1) log b So we have x, s, p(y|x, s; θ) 1/b. Then p(y|x; θ) = ES h p(y|x, S; ˆθ) i 1 Lemma 12. if y is not the majority class, i.e. y = y , then for parameter θ = {ˆλ, αˆη} p(y|x, s, θ) e cα ˆη 2 p(y|x, s, θ) = eαˆηT uy P y Y eαˆηT uy eαˆηT uy eαˆηT uy e cα ˆη 2 Lemma 13. For a fixed x and s, the absolute value of the entry of the vector under the parameter θ = {ˆλ, αˆη}: |p(y|x, s; θ)(uy EY [u Y ])|i β(k 1)e cα ˆη 2 Proof. Suppose y is the majority class of p(y|x, s; θ). Then, uy Ey[u Y ] = (vy )k y =1 vy = (1 p(y|x, s; θ)h(L 1) if y = y p(y|x, s; θ)h(L 1) otherwise From Lemma 12, we have |p(y|x, s; θ)(uy EY [u Y ])|i |(uy EY [u Y ])|i β(k 1)e cα ˆη 2 Now, we suppose y is not the majority class of p(y|x, s; θ). Then, |p(y|x, s; θ)(uy EY [u Y ])|i p(y|x, s; θ)β βe cα ˆη 2 Overall, the lemma follows. Lemma 14. We denote the matrix A = ES h p(y|x,s; θ) p(y|x; θ) (uy EY [u Y ])(uy EY [u Y ])T i ES h p(y|x,s; θ) p(y|x; θ) (uy EY [u Y ]) i ES h p(y|x,s; θ) p(y|x; θ) (uy EY [u Y ]) i T Then the absolute value of the entry of A under the parameter θ = {ˆλ, αˆη}: |Aij| 2b(k 1)β2e cα ˆη 2 Published as a conference paper at ICLR 2017 Proof. From Lemma 11, we have p(y|x; θ) 1/b. Additionally, the absolute value of the entry of uy EY [u Y ] is bounded by β. We have for each i ES " p(y|x, s; θ) p(y|x; θ) (uy EY [u Y ]) " p(y|x, s; θ) p(y|x; θ) β Then from Lemma 13 |Aij| 2b(k 1)β2e cα ˆη 2 Lemma 15. We denote the matrix " p(y|x, s; θ) EY u Y u T Y EY [u Y ]EY [u Y ]T # Then the absolute value of the entry of B under the parameter θ = {ˆλ, αˆη}: |Bij| 2(k 1)β2e cα ˆη 2 Proof. We only need to prove that for fixed x and s, for each i, j: EY u Y u T Y EY [u Y ]EY [u Y ]T ij 2(k 1)β2e cα ˆη 2 EY u Y u T Y EY [u Y ]EY [u Y ]T ij = |Cov Y [(u Y )i, (u Y )j]| β2 k X y=1 p(y|x, s; θ) p(y|x, s; θ)2 Suppose y is the majority class. Then from Lemma 12, p(y|x, s; θ) p(y|x, s; θ)2 1 p(y|x, s; θ) (k 1)e cα ˆη 2 If y is not the majority class. Then, p(y|x, s; θ) p(y|x, s; θ)2 p(y|x, s; θ) e cα ˆη 2 So we have k X y=1 p(y|x, s; θ) p(y|x, s; θ)2 2(k 1)e cα ˆη 2 The lemma follows. Lemma 16. Under the parameter θ = {ˆλ, αˆη}, the largest eigenvalue of the matrix i=1 (A(xi, yi) B(xi, yi)) (8) is at most 2dk(k 1)(b + 1)β2e cα ˆη 2 Proof. From Lemma 14 and Lemma 15, each entry of the matrix in (8) is at most 2(k 1)(b + 1)β2e cα ˆη 2. Thus, by Gershgorin s theorem, the maximum eigenvalue of the matrix in (8) is at most 2dk(k 1)(b + 1)β2e cα ˆη 2. Now, we can prove Theorem 5 by constructing a scaled version of ˆθ that satisfies the expectationlinearization constraint. Published as a conference paper at ICLR 2017 Proof of Theorem 5 Proof. Consider the likelihood evaluated at θ = {ˆλ, αˆη}, where α = δ 4β ˆη 2 . If α > 1, then η 2 > δ 4β . We know the MLE ˆθ already satisfies the expectation-linearization constraint. So we can assume that 0 α 1, and we know that θ satisfies V (D; θ) δ. Then, l(ˆθ, ˆθδ) l(ˆθ, θ) = 1 n(l(D; ˆθ) l(D; θ)) = g(ˆλ, ˆη) g(ˆλ, αˆη) where g(λ, η) = 1 nl(D; (λ, η)). Taking the second-order Taylor expansion about η, we have g(ˆλ, αˆη) = g(ˆλ, ˆη) + T η g(ˆλ, ˆη)(αˆη ˆη) + (αˆη ˆη)T 2 ηg(ˆλ, ˆη)(αˆη ˆη) Since ˆθ is the MLE, the first-order term T η g(ˆλ, ˆη)(αˆη ˆη) = 0. The Hessian in the second-order term is just Eq.(8). Thus, from Lemma 16 we have g(ˆλ, αˆη) g(ˆλ, ˆη) (1 α)2 ˆη 2 22dk(k 1)(b + 1)β2e cα ˆη 2 = g(ˆλ, ˆη) 2dk(k 1)(b + 1)β2 ˆη 2 δ 4β 2 e cδ/4β = g(ˆλ, ˆη) c1β2 ˆη 2 δ 4β 2 e c2δ/4β with setting c1 = 2dk(k 1)(b + 1) and c2 = c. Then the theorem follows. C.3 PROOF OF THEOREM 6: UNIFORM BOUND OF MODEL ACCURACY In the following, we denote θ = {ˆλ, αˆη}. Lemma 17. For each y Y, if p(y|x, s; ˆθ) 1/k, then α [0, 1] p(y|x, s; θ) 1 Proof. This lemma can be regarded as a corollary of Lemma 11. Lemma 18. For a fixed x and s, we denote eˆηT y h(L 1)(x,s;ˆλ) = wy. Then we have p(y|x, s, θ) = eαˆηT y h(L 1)(x,s;ˆλ) P y Y eαˆηT y h(L 1)(x,s;ˆλ) = (wy)α P Additionally, we denote gs(α) = P y Y p(y |x, s; θ) log wy log wy. We assume gs(0) 0. Then we have α 0 gs(α) 0 y Y log wy p(y |x, s; θ) α = Var Y [log w Y |X x, S = s] 0 So gs(α) is non-decreasing. Since gs(0) 0, we have gs(α) 0 when α 0. From above lemma, we have for each training instance (xi, yi) D, and α [0, 1], EY h log p(Y |xi, s; θ) i log p(yi|xi, s; θ) (9) For convenience, we define m(s, y) = log p(y|x, s; θ) EY h log p(Y |x, s; θ) i Published as a conference paper at ICLR 2017 Lemma 19. If y satisfies Lemma 17 and gs(α) 0, then Var Y [m(s, Y )] m(s, y)2 Proof. First we have m(s, y) = log p(y|x, s; θ) log 1/k KL p( |x, s; θ)|Unif(Y) 0 (Var Y [m(s, Y )])1/2 = log p(Y |x, s; θ) EY h log p(Y |x, s; θ) i 2 EY h log p(Y |x, s; θ) EY h log p(Y |x, s; θ) i i = EY h KL p( |x, s; θ)|Unif(Y) + log 1/k log p(Y |x, s; θ) i = EY h KL p( |x, s; θ)|Unif(Y) + log 1/k log p(Y |x, s; θ) i KL p( |x, s; θ)|Unif(Y) + EY h log p(Y |x, s; θ) log 1/k i = 2KL p( |x, s; θ)|Unif(Y) As KL p( |x, s; θ)|Unif(Y) 0 and log p(y|x, s; θ) log 1/k. So we have 2KL p( |x, s; θ)|Unif(Y) KL p( |x, s; θ)|Unif(Y) +log 1/k log p(y|x, s; θ) = m(s, y) Then the lemma follows. From Lemma 19 and Eq. (9), we have for each training instance (xi, yi) D, and α [0, 1], Var Y [m(s, Y )] m(s, yi)2 (10) Lemma 20. For each training instance (xi, yi) D, and α [0, 1], we have log p(yi|xi; {ˆλ, αˆη}) (1 α) log p(yi|xi; {ˆλ, 0}) + α log p(yi|xi; {ˆλ, ˆη}) Proof. We define f(α) = log p(yi|xi; {ˆλ, αˆη}) (1 α) log p(yi|xi; {ˆλ, 0}) α log p(yi|xi; {ˆλ, ˆη}) Because f(0) = f(1) = 0, we only need to prove that f(α) is concave on [0, 1]. We have 2f(α) = ES|Y =yi [Var Y [m(S, Y )]] + Var S|Y =yi [m(S, yi)] where S|Y = yi is under the probability distribution p(s|Y = yi, xi; θ) = p(yi|xi,S; θ)p(s) p(yi|xi; θ) From Eq. (10), we have ES|Y =yi [Var Y [m(S, Y )]] ES|Y =yi m(S, yi)2 Var S|Y =yi [m(S, yi)] So we have 2f(α) 0. The lemma follows. Now, we can prove Theorem 6 by using the same construction of an expectation-linearizing parameter as in Theorem 5. Proof of Theorem 6 Proof. Consider the same parameter θ = {ˆλ, αˆη}, where α = δ 4β ˆη 2 1. we know that θ satisfies V (D; θ) δ. Then, l(ˆθ, ˆθδ) l(ˆθ, θ) = 1 n(l(D; ˆθ) l(D; θ)) Published as a conference paper at ICLR 2017 From Lemma 20 we have: l(D; θ) = l(D; {ˆλ, αˆη}) (1 α)l(D; {ˆλ, 0}) + αl(D; {ˆλ, ˆη}) So l(ˆθ, ˆθδ) (1 α) 1 n l(D; ˆθ) l(D; {ˆλ, 0}) i=1 log p(yi|xi; ˆθ) log Unif(Y) (1 α)E [KL (p( |X; θ) Unif(Y))] 1 δ 4β ˆη 2 E [KL (p( |X; θ) Unif(Y))] D DETAILED DESCRIPTION OF EXPERIMENTS D.1 NEURAL NETWORK ARCHITECTURES MNIST For MNIST, we train 6 different fully-connected (dense) neural networks with 2 or 3 layers (see Table 1). For all architectures, we used dropout rate p = 0.5 for all hidden layers and p = 0.2 for the input layer. CIFAR-10 and CIFAR-100 For the two CIFAR datasets, we used the same architecture in Srivastava et al. (2014) three convolutional layers followed by two fully-connected hidden layers. The convolutional layers have 96, 128, 265 filters respectively, with a 5 5 receptive field applied with a stride of 1. Each convolutional layer is followed by a max pooling layer pools 3 3 regions at strides of 2. The fully-connected layers have 2048 units each. All units use the rectified linear activation function. Dropout was applied to all the layers with dropout rate p = (0.1, 0.25, 0.25, 0.5, 0.5, 0.5) for the layers going from input to convolutional layers to fully-connected layers. D.2 NEURAL NETWORK TRAINING Neural network training in all the experiments is performed with mini-batch stochastic gradient descent (SGD) with momentum. We choose an initial learning rate of η0, and the learning rate is updated on each epoch of training as ηt = η0/(1 + ρt), where ρ is the decay rate and t is the number of epoch completed. We run each experiment with 2,000 epochs and choose the parameters achieving the best performance on validation sets. Table 3 summarizes the chosen hyper-parameters for all experiments. Most of the hyper-parameters are chosen from Srivastava et al. (2014). But for some experiments, we cannot reproduce the performance reported in Srivastava et al. (2014) (We guess one of the possible reasons is that we used different library for implementation.). For these experiments, we tune the hyper-parameters on the validation sets by random search. Due to time constrains it is infeasible to do a random search across the full hyper-parameter space. Thus, we try to use as many hyper-parameters reported in Srivastava et al. (2014) as possible. D.3 EFFECT OF EXPECTATION-LINEARIZATION RATE λ Table 4 illustrates the detailed results of the experiments on the effect of λ. For MNIST, it lists the error rates under different λ values for six different network architectures. For two datasets of CIFAR, it gives the error rates under different λ values, among with the empirical expectation-linearization risk ˆ . Published as a conference paper at ICLR 2017 Table 3: Hyper-parameters for all experiments. Experiment Hyper-parameter batch size 200 initial learning rate η0 0.1 decay rate ρ 0.025 momentum 0.9 momentum type standard max-norm constrain 3.5 10 100 batch size 100 100 initial learning rate η0 for conv layers 0.001 0.001 initial learning rate η0 for dense layers 0.1 0.02 decay rate ρ 0.005 0.005 momentum 0.95 0.95 momentum type standard nesterov max-norm constrain 4.0 2.0 L2-norm decay 0.001 0.001 Table 4: Detailed results for experiments on the effect of λ. λ Experiment 0.0 0.5 1.0 2.0 3.0 5.0 7.0 10.0 model 1 1.23 1.12 1.12 1.08 1.07 1.10 1.25 1.35 model 2 1.19 1.14 1.08 1.04 1.03 1.07 1.13 1.21 model 3 1.05 1.04 0.98 1.03 1.05 1.05 1.10 1.12 model 4 1.07 1.02 0.97 0.94 0.96 1.01 1.05 1.20 model 5 1.03 0.95 0.95 0.90 0.92 0.98 1.03 1.08 model 6 0.99 0.98 0.93 0.87 0.96 0.98 1.05 1.10 λ 0.0 0.1 0.5 1.0 2.0 5.0 10.0 CIFAR-10 error rate 12.82 12.52 12.38 12.20 12.60 12.84 13.10 ˆ 0.0139 0.0128 0.0104 0.0095 0.0089 0.0085 0.0077 CIFAR-100 error rate 37.22 36.75 36.25 37.01 37.18 37.58 38.01 ˆ 0.0881 0.0711 0.0590 0.0529 0.0500 0.0467 0.0411