# dropout_training_is_distributionally_robust_optimal__6d657887.pdf Journal of Machine Learning Research 24 (2023) 1-60 Submitted 4/21; Revised 4/23; Published 6/23 Dropout Training is Distributionally Robust Optimal Jos e Blanchet jose.blanchet@stanford.edu Department of Management Science and Engineering Stanford University Stanford, CA 94305, USA Yang Kang yangkang@stat.columbia.edu Department of Statistics Columbia University New York, NY 10027, USA Jos e Luis Montiel Olea jlo67@cornell.edu Department of Economics Cornell University Ithaca, NY 14850, USA Viet Anh Nguyen nguyen@se.cuhk.edu.hk Department of Systems Engineering and Engineering Management Chinese University of Hong Kong Hong Kong SAR Xuhui Zhang xuhui.zhang@stanford.edu Department of Management Science and Engineering Stanford University Stanford, CA 94305, USA Editor: Samory Kpotufe This paper shows that dropout training in generalized linear models is the minimax solution of a two-player, zero-sum game where an adversarial nature corrupts a statistician s covariates using a multiplicative nonparametric errors-in-variables model. In this game, nature s least favorable distribution is dropout noise, where nature independently deletes entries of the covariate vector with some fixed probability δ. This result implies that dropout training indeed provides out-of-sample expected loss guarantees for distributions that arise from multiplicative perturbations of in-sample data. The paper makes a concrete recommendation on how to select the tuning parameter δ. The paper also provides a novel, parallelizable, unbiased multi-level Monte Carlo algorithm to speed-up the implementation of dropout training. Our algorithm has a much smaller computational cost compared to the naive implementation of dropout, provided the number of data points is much smaller than the dimension of the covariate vector. Keywords: generalized linear models, generalization error, distributionally robust optimization, machine learning, multi-level Monte Carlo c 2023 Jos e Blanchet, Yang Kang, Jos e Luis Montiel Olea, Viet Anh Nguyen, Xuhui Zhang. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-0377.html. Blanchet, Kang, Olea, Nguyen and Zhang 1. Introduction Dropout training is an increasingly popular estimation method in machine learning.1 The general idea consists in ignoring some dimensions of the covariate vector at random while estimating the parameters of a statistical model. A common motivation for dropout train- ing is that the random feature selection implicitly performs model averaging, potentially improving prediction error.2 This paper contributes to the growing literature explaining how dropout training can improve a predictor s generalization error, e.g., Wager et al. (2013), Helmbold and Long (2015), Wei et al. (2020). Broadly speaking, a predictor s generalization error refers to its ability to perform well on new, previously unseen inputs not just those on which our model was trained. (Goodfellow et al., 2016, Section 5.2). As we explain below, our main result shows that dropout training improves generalization error over distributions that arise from multiplicative perturbations of in-sample covariates. To make this point, this paper studies dropout training in the context of generalized linear models. A generalized linear model for the scalar outcome variable Y Y R, given a d-dimensional vector of covariates X = (X1, . . . , Xd) Rd, is defined by the conditional f(Y |X, θ) h(Y, φ) exp Y β X Ψ(β X) /a(φ) , (1) where the model s parameters are θ (β , φ) ; see Mc Cullagh and Nelder (1989, Equa- tion 2.4). In our notation, h( , φ) is a real-valued function parameterized by φ and defined on the domain Y, a( ) is a positive function of φ, and Ψ( ) is the log-partition function which we assume to be defined on all the real line. Normal, logistic, and Poisson regression all have conditional densities of the form in Equation (1). Prediction in generalized linear models is typically based on the conditional expectation of Y , given X, implied by the likelihood in (1), evaluated at an estimated parameter vector bθ (for example, one common choice for bθ is the vector containing the maximum likelihood estimators for β and φ based on the available data).3 We define the generalization error of a predictor for Y as: EQ h ln f(Y |X, bθ ) i , where (X, Y ) Q. (2) 1. Section 7.12 of Goodfellow et al. (2016) provides a textbook treatment on dropout training. Bishop (1995) and Srivastava et al. (2014) are seminal references on this topic. 2. See Hinton et al. (2012) for a discussion about this point in the context of neural networks. See also Draper (1994) and Raftery et al. (1997) for classical results on the optimality of model averaging for prediction purposes. 3. Throughout the paper, we will typically use bθ, bβ, bφ to denote the estimators of parameters θ, β, φ, without necessarily making explicit reference to the data or the sample size. Dropout Training is Distributionally Robust Optimal The above expectation is computed by fixing the estimated parameter vector, bθ, and then drawing new covariates and outcomes according to the joint distribution Q. When Q dif- fers from the data s empirical distribution, we interpret (2) as a measure of out-of-sample performance (as the draws from Q can be thought of as new, and previously unseen, inputs). Our main result (Theorem 1) shows that, when Q corresponds to multiplicative perturba- tion of the covariates empirical distribution (in a sense we make precise), the generalization error of dropout training is no larger than the in-sample loss of dropout training averaged over dropout noise (Corollary 1). Therefore, our main result implies that dropout training generalizes well to previously unseen data distributions arising from in-sample covariates multiplicative perturbations. We establish our main result by showing that dropping out input features when training generalized linear models can be viewed as a Distributionally Robust Optimization (DRO) problem (Shapiro et al., 2014). A DRO problem is a two-player, zero-sum game between a decision maker (a statistician) and an adversary (nature). The statistician wishes to choose an action to minimize a given expected loss (e.g., squared loss in a typical linear regression setting or, more generally, the negative of the log-likelihood function). At the same time, nature intends this loss to be maximal. We consider a framework in which nature can harm the statistician by corrupting the available covariates using a multiplicative nonparametric errors-in-variables model, as in the classic work of Hwang (1986).4 Under mild assumptions, nature s least favorable distribution in this game turns out to be dropout noise, where nature independently deletes entries of the covariate vector with some fixed probability δ. Because dropout noise is least favorable, we can use the in-sample loss of dropout training averaged over the dropout noise to upper bound the generalization error of predictors computed via dropout training, under any probability distribution that arises via multiplicative perturbation of the covariates in the sample. One might argue that the significance and usefulness of Theorem 1 is limited, as the distributions over which dropout training generalizes well only cover perturbations around the data s empirical distribution. Indeed, in the quintessential definition of generalization error, the new examples are typically drawn from the true data-generating process, which we denote as P . Do our results have anything to say about the case in which the generalization error in (2) is evaluated at P ? We answer this question in the positive, provided that the dropout probability, δ, is chosen to be c/ n where n denotes the number of training examples. In particular, we show that the generalization error of dropout training based on examples drawn from the true data-generating process will be bounded above by the loss of dropout in the training sample (averaged over the dropout noise) with a probability 4. As far as we know, this simple action space for nature is a novel model for conceptually and quantitatively understanding dropout training. Blanchet, Kang, Olea, Nguyen and Zhang that depends on c (Theorem 2). Consequently, by choosing a target probability, say 95%, it is possible to provide a concrete recommendation for selecting c and, therefore, for δ. As we will explain later, our concrete recommendation for choosing the dropout probability in generalized linear models stands in contrast to ad hoc suggestions available for neural networks, in which an input unit is usually included with probability 0.8, and a hidden unit is included with probability 0.5 (Goodfellow et al., 2016, Chapter 7, p. 257). Finally, and to analyze the implications of our results numerically, we make an additional contribution by suggesting a new stochastic optimization implementation of dropout train- ing. The generalization error of the standard stochastic gradient descent implementation of dropout training here profits from the implicit regularization imposed by the stochastic gradient descent routine (Wei et al., 2020). Since our theoretical analysis makes no use of this implicit regularization, it is important to have an algorithm that does not introduce any further bias to the solution of dropout training. We borrow ideas from the multi-level Monte Carlo literature in particular from the work of Blanchet et al. (2019a) to suggest an unbiased (in a sense we will make precise) dropout training routine. Our algorithm is easily parallelizable and has a much smaller computational cost than naive dropout training methods when the number of features is large (Theorem 3). Our algorithm thus comple- ments the recent literature suggesting approaches to speed-up dropout training by either using a parallelized implementation of stochastic gradient descent (Zinkevich et al., 2010) or a fast dropout training based on Gaussian approximations (Wang and Manning, 2013). The rest of the paper is organized as follows. Section 2 explains dropout training in the context of generalized linear models. Section 3 presents a general description of the DRO framework used in this paper. Section 4 zeros in on the DRO problem by using the negative log-likelihood of generalized linear models to define a loss function for the statistician, and by allowing nature to harm the statistician via a multiplicative errors-in-variables model for the covariates. This section also presents our main theorem. Section 5 presents our approach to select the dropout probability, δ. Section 6 discusses different computational methods available for implementing dropout training (full integration, stochastic gradient descent, na ıve Monte Carlo integration) and suggests our unbiased multi-level Monte Carlo algorithm. Section 7 presents some simulations comparing our recommended selection of δ to cross-validation and our preferred implementation of dropout training over stochastic gradient descent. Finally, Section 8 discusses extensions of our results to a particular class of feed-forward neural networks with a single hidden layer. We show that dropout training of the hidden units in the hidden layer is distributionally robust and optimal. All of the proofs are collected in the Appendix. Dropout Training is Distributionally Robust Optimal 2. Dropout Training in Generalized Linear Models This section describes dropout training in the context of generalized linear models. As with some other recent papers in the literature, we view generalized linear models as a convenient, transparent, and relevant framework to better understand the theoretical and algorithmic properties of dropout training. For a given covariate vector xi, and a user-selected constant, δ [0, 1), define the d-dimensional random vector ξi = (ξi,1, . . . , ξi,d) {0, 1/(1 δ)}d, where each of the d entries of ξi is an independent draw from a scaled Bernoulli distribution with parameter 1 δ. This is, for j = 1, . . . , d: 0 with probability δ, (1 δ) 1 with probability (1 δ). (3) The distribution of ξi,j collapses to ξi,j = 1 with probability 1 when δ = 0. Let denote the binary operator defining element-wise multiplication between two vectors of the same dimension. Consider the covariate vector xi ξi (xi,1ξi,1, . . . , xi,dξi,d) . (4) Some entries of the new covariate vector are 0 (those for which ξi,j = 0), and the rest are equal to xi,j/(1 δ). In a slight abuse of notation, let Eδ denote the distribution of the random vector ξi, whose distribution is parameterized by δ. The estimators of (β, φ) obtained by dropout training correspond to any parameters (bβ(δ), bφ(δ)) that solve the problem inf β,φ 1 n i=1 Eδ [ ln f(yi|xi ξi, β, φ)] . (5) Note that the maximum likelihood estimator of (β, φ), denoted as (bβML, bφML), equals (bβ(0), bφ(0)). There is some empirical evidence that using intentionally corrupted features for training has the potential to improve the performance of machine learning algorithms (Maaten et al., 2013). Even if one is willing to accept that feature corruption is desirable for estimation, the choice of dropout noise in (3) seems arbitrary. Blanchet, Kang, Olea, Nguyen and Zhang To explain the empirical success of the use of dropout noise. Wager et al. (2013) and Helmbold and Long (2015) theoretically analyzed the dropout training criterion (5) as the composition of the original loss (e.g., Equation 5 with δ = 0) and a penalty term that they view as a regularizer. They then analyzed this dropout regularizer and compared its regularization effect with other regularizers (such as L1 and L2 penalties). Specifically, Wa- ger et al. (2013) proposed a convex quadratic approximation for the regularization penalty using first-order Taylor approximation in generalized linear models. While Helmbold and Long (2015) conducted a more explicit analysis in the context of logistic regression and suggested several non-standard properties of the dropout regularizer. Relative to these papers, our work focuses on providing a novel decision-theoretic inter- pretation of dropout training (in the population and the sample). We will argue there is a natural two-player, zero-sum game between a decision maker (statistician) and an adver- sary (nature) in which dropout training emerges naturally as a minimax solution. In this game, dropout noise is nature s least favorable distribution, and dropout training becomes the statistician s optimal action. The framework we use, from the stochastic optimization literature, is distributionally robust optimization. 3. Distributionally Robust Optimization Consider a general prediction problem where there is a multivariate predictor X Rd and a scalar outcome variable Y R. A distributionally robust optimization (DRO) problem is a simultaneous two-player zero-sum game between a decision maker (in our context, the statistician) and an adversary (nature).5 In this section, we describe the action space for each player, their strategies, and the payofffunction. 3.1 Actions and Payoff The statistician s action space consists of vectors θ Θ. The ranking of the statistician s actions is contingent on the realization of (X, Y ), and this is captured by a real-valued loss function ℓ(X, Y, θ). If the statistician knew the distribution of (X, Y ) which we denote by Q the statistician s preferred choice of θ would be the solution to inf θ Θ EQ [ℓ(X, Y, θ)] . (6) 5. A seminal reference is the robust inventory control problem of Scarf (1958). Recent references describing the use of distributionally robust stochastic programs (like those considered in this paper) include Delage and Ye (2010) and Shapiro (2017). Christensen and Connault (2019) used distributionally robust optimization to characterize the sensitivity of counterfactual analysis with respect to distributional assumptions in a class of structural econometric models. Dropout Training is Distributionally Robust Optimal Instead of assuming that the distribution Q is determined exogenously, we think of the distribution Q as being chosen by nature. Thus, nature s action space consists of a set of probability distributions denoted as U. We refer to this set as the distributional uncertainty set. If nature knew the action selected by the statistician, nature s preferred action would sup Q U EQ[ℓ(X, Y, θ)]. (7) 3.2 Strategies and Solution The choices of θ and Q are assumed to happen simultaneously. A statistician s strategy for this game consists of a choice of θ; nature s strategy for this game consists of a choice of Q. A Nash equilibrium for this game is a pair (θ , Q ) such that: a) given Q , the pa- rameter θ solves (6) and b) given θ , the distribution Q solves (7). The minimax solution for this game is a pair (θ, Q) that solves sup Q U EQ[ℓ(X, Y, θ)], (8a) while the maximin solution is based on the mathematical program sup Q U inf θ Θ EQ[ℓ(X, Y, θ)]. (8b) If Q solves (8b), we say that Q is nature s least favorable distribution. The mathematical program in (8a) is typically referred to as a DRO problem. 4. Dropout Training is Distributionally Robust Optimal We now specialize the general DRO framework of Section 3 by imposing two restrictions. First, we use the negative log-likelihood of generalized linear models (Mc Cullagh and Nelder, 1989) as a loss function for the statistician. Second, we define nature s uncertainty set (i.e., the possible data distributions that nature can take) using the multiplicative errors-in- variables model of Hwang (1986). 4.1 Statistician s Payoff We define the loss function for the statistician to be the negative of the logarithm of the likelihood in (1), that is, ℓ(X, Y, θ) = ln h(Y, φ) + (Ψ(β X) Y (β X))/a(φ), (9) Blanchet, Kang, Olea, Nguyen and Zhang where θ (β , φ) Θ Rd R+. Equation (9) defines the statistician s objective and its set of actions. 4.2 Nature s Distributional Uncertainty Set We now define the possible distributions that nature can choose. We start out by letting Q0 denote some benchmark or reference distribution over (X, Y ). This distribution will typically be the empirical distribution of the data, but we present our main result allowing for a general Q0. This distribution need not correspond to that induced by a generalized linear model. In other words, our framework allows for the statistician s model to be mis- Next, we define nature s action space by considering perturbations of Q0. Although there are different ways of doing this for example, by using either f-divergences (such as the Kullback Leibler divergence, as used by Nguyen et al. 2020 and the χ2 divergence, as done by Duchi and Namkoong 2019) or the optimal transport distance (such as the Wasserstein distance, as in Blanchet et al. 2019b) to define a neighborhood we herein use a nonparametric multiplicative errors-in-variables model as in Hwang (1986).6 The idea is to allow nature to independently introduce measurement error to the covari- ates using multiplicative noise. Let ξ (ξ1, . . . , ξd) be defined as a d-dimensional vector of random variables that are independent of (X, Y ). We perturb the distribution Q0 by considering the transformation (X, Y ) 7 (X1ξ1, . . . , Xdξd, Y ) . As a result, each covariate Xj is distorted, in a multiplicative fashion, by ξj. We often abbreviate (X1ξ1, . . . , Xdξd) by X ξ, where signifies element-wise multiplication. We restrict the distribution of ξ in the following way: First, for a parameter δ [0, 1), we define Qj(δ) to be the set of distributions for ξj that are supported on the interval [0, 1/(1 δ)] and that have mean equal to 1. More specifically, Qj(δ) Qj : Qj is a probability distribution on R, Qj([0, (1 δ) 1]) = 1, EQj[ξj] = 1 . This set of distributions, prescribed using support and first-order moment information, is popular in the DRO literature thanks to its simplicity and tractability (Wiesemann et al., 2014). From the perspective of an errors-in-variables model, these distributions are also 6. The different choices of action space in the DRO literature have been shown to enjoy regularization benefits. For example, Duchi and Namkoong (2019) showed that a DRO formulation with χ2 divergence leads to convex variance regularization. Blanchet et al. (2019b) showed that DRO with Wasserstein distance amounts to square-root LASSO in linear regressions. In contrast, the action space we introduce induces dropout regularization. Dropout Training is Distributionally Robust Optimal attractive because they preserve the expected value of the covariates, assuming that Xj and ξj are drawn independently. Consider now the joint random vector (X, Y, ξ) Rd R Rd. For a constant δ [0, 1) consider the joint distributions over (X, Y, ξ) defined by U(Q0, δ) = {Q0 Q1 . . . Qd : Qj Qj(δ) j = 1, . . . , d} , (11) where denotes the product measure (meaning that the joint distribution is the product of the independent marginals Qj, j = 0, . . . , d). Thus, in the game, we consider U(Q0, δ) as nature s action space or nature s distributionally uncertainty set. We will make only one assumption about the reference distribution: Q0 : Assumption 1 The distribution Q0 satisfies EQ[ℓ(X ξ, Y, θ)] < for any Q U(Q0, δ), any θ Θ, and any scalar δ [0, 1). This assumption implies a minimal regularity condition to guarantee that the expected loss is well defined for both the statistician and nature. Assumption 1 holds trivially when Q0 is the empirical distribution of the data, which is one of the main cases of interest in 4.3 Dropout Training is Distributionally Robust Optimal Theorem 1 Consider the two-player zero-sum game where the statistician has the loss function in (9) and nature has the action space in (11) for some reference distribution Q0 and a scalar δ [0, 1). If Assumption 1 holds, then for any θ Θ, sup Q U(Q0,δ) EQ [ℓ(X ξ, Y, θ)] (12) is equivalent to EQ [ℓ(X ξ, Y, θ)], (13) where Q = Q0 Q 1 . . . Q d , and Q j = (1 δ) 1 Bernoulli(1 δ) is a scaled Bernoulli distribution for any j = 1, . . . , d, i.e., under Q j 0 with probability δ, (1 δ) 1 with probability (1 δ). (14) In addition, let θ Θ be a minimizer of (13). Then (θ , Q ) constitutes a Nash equi- librium of the two-player zero-sum game defined by (9) and (11), and Q is nature s least favorable distribution. Blanchet, Kang, Olea, Nguyen and Zhang Proof See Appendix A.2. The first part of the theorem characterizes nature s worst-case perturbation of Q0. From the statistician s perspective, nature s worst-case perturbation of Q0 is given by the dropout noise Q in (14). Under this worst-case distribution, nature independently corrupts each entry of X = (X1, . . . , Xd) , either by dropping the j-th component (if ξj = 0), or by re- placing it by Xj/(1 δ). Dropout training which here refers to estimating the parameter θ after adding dropout noise to X thus becomes the statistician s preferred way of esti- mating the parameter θ when facing an adversarial nature. This gives a decision-theoretic foundation for the use of dropout training. Note that, in order to recover the objective function introduced in (5) (the sample average of the contaminated log-likelihood), it suffices to set the reference measure Q0 as the empirical distribution b Pn of {(xi, yi)}n i=1, which satisfies Assumption 1. We provide now some intuition about how dropout noise becomes nature s worst-case distribution. Algebra shows that, in light of Assumption 1, the expected loss under an arbitrary distribution Q is finite and can be written as EQ[ℓ(X ξ, Y, θ)] = EQ0 [ln h(Y, φ)] + EQ0 h EQ1 ... Qd[(Ψ((β X) ξ) Y ((β X) ξ))/a(φ)] i , where the first expectation is taken with respect to the reference distribution, and the second with respect to ξ. For fixed values of (X, Y, θ), we can define the function A(X,Y,θ)((β X) ξ) (Ψ((β X) ξ) Y ((β X) ξ))/a(φ). The key step to establishing Theorem 1 is to show that sup n EQ1 ... Qd[A(X,Y,θ)((β X) ξ)] : Qj Qj(δ) o = EQ 1 ... Q d [A(X,Y,θ)((β X) ξ)] for any θ. The proof of this equality crucially exploits the convexity of the function A(X,Y,θ) : R R, which we show to be a consequence of the convexity of the log-partition function Ψ( ).7 We make four important remarks about the equality in (15), and about the role that convexity plays in the optimality of dropout training. Remark 1. First, a natural question to ask is whether the convexity of the log-partition function Ψ imposes a significant restriction on the class of generalized linear models that 7. To derive our result, we first characterize the worst-case distribution for the expectation of a real-valued convex function (Lemma 1) and then we generalize this result to functions that depend on ξ only through linear combinations, as A(X,Y,θ)( ) (Lemma 2). Dropout Training is Distributionally Robust Optimal we are considering. It is known that the log-partition function of any generalized linear model with an open parameter space is convex; see Proposition 3.1 in Wainwright and Jordan (2008). This immediately implies that the convexity of Ψ should not be viewed as a significant restriction. Remark 2. Second, it is reasonable to inquire whether our results concerning the optimality of dropout training could be extended beyond generalized linear models. For instance, one could be interested in considering a model in which linear predictors are obtained using an objective function of the form EQ0[f(Y, β X)], where f(y, ) : R R, and (X, Y ) Q0. This formulation includes certain types of extremum estimators as well as single-index models estimated by nonlinear least squares. Theorem 4 in Appendix A.6.1 shows that, if dropout noise is nature s worst-case distribution among multiplicative perturbations of covariates; that is, if sup Q U(Q0,δ) EQ[f(Y, β X ξ)] = EQ [f(Y, β X ξ)] (16) for some δ (0, 1], all d 1, all β Rd and all reference distributions Q0, then f(y, ) must be a convex function on the real line for any given y. This means that convexity indeed plays a crucial role in the optimality of dropout training and, while it is possible to consider extensions outside the class of generalized linear models, such extensions will still have to impose some form of convexity (which may turn out to be very strong).8 In Appendix A.6.2, we also argue that, beyond generalized linear models, the convexity of the objective function and the definition of U are, in general, not sufficient to make dropout training minimax-optimal. Remark 3. Third, to further understand the extent to which convexity-like restrictions imply the optimality of dropout noise, we consider a generalization of the multiplicative errors-in-variables model that allows for correlated noise. We show that, in this case, there is a simple generalization of dropout (that we term block dropout ) that drops out blocks of variables at the same time. See Appendix A.6.3. Remark 4. Fourth, Theorem 1 does not imply that U(Q0, δ) is the largest class of distri- butions for which dropout training is distributionally robust optimal (or for which dropout noise is the worst-case distribution). In appendix A.6.4 we show that in the context of the linear regression model the dropout estimator can also be obtained by solving a dis- tributionally robust optimization problem over additive contamination models (provided 8. For instance, in Appendix A.6.1, we show that the only single-index model with an objective function of the form f(y, β x) = (y g(β x))2 for which dropout training is optimal is the linear regression model; that is, g(β x) = β x. Blanchet, Kang, Olea, Nguyen and Zhang the additive perturbations are independent of the data, have a mean of zero, and satisfy a certain bound on their second moments). Thus, the union of additive and multiplicative perturbations still gives dropout noise as the worst-case distribution, and makes the dropout training estimator distributionally robust optimal. How about the Nash Equilibrium of the two-player zero-sum game defined by Equa- tions (9) and (11)? The equality in Equation (15) clearly shows that Q is nature s best response for any θ Θ. If there is a vector θ that solves the dropout training problem in Equation (13), then this vector is the statistician best s response to nature s choice of Q . Consequently, (θ , Q ) is a Nash equilibrium. Finally, we discuss the extent to which Q can be referred to as nature s least favorable distribution, which has been defined as nature s solution to the maximin problem. It is well known that the maximin value of a game is always smaller than its minimax value:9 sup Q U(Q0,δ) inf θ Θ EQ[ℓ(X ξ, Y, θ)] inf sup Q U(Q0,δ) EQ[ℓ(X ξ, Y, θ)]. We have shown that the right-hand side of the display above equals the infimum of (13). Therefore, if there is a θ Θ that solves such program, then Q achieves the upper bound to the maximin value of the game. To see this, note that by definition of the left-hand side of the display above inf θ Θ EQ [ℓ(X ξ, Y, θ)] sup Q U(Q0,δ) inf θ Θ EQ[ℓ(X ξ, Y, θ)], while we also have inf θ Θ EQ [ℓ(X ξ, Y, θ)] = inf sup Q U(Q0,δ) EQ[ℓ(X ξ, Y, θ)], by Equation (13). This makes dropout noise nature s least favorable distribution. Now that we have established that dropout training gives the minimax solution of the DRO game, we discuss the implications of this result regarding the out-of-sample perfor- mance of dropout training. Suppose Q0 is the empirical measure b Pn supported on n training samples {(xi, yi)}n i=1. Let bθ(δ) denote the estimators of (β, φ) based on dropout training with dropout probability δ. The in-sample loss of dropout training, averaged over dropout 9. This follows from the fact that for any Q U(Q0, δ) : inf θ Θ EQ[ℓ(X ξ, Y, θ)] EQ[ℓ(X ξ, Y, θ)] sup Q U(Q0,δ) EQ[ℓ(X ξ, Y, θ)]. See also the discussion of the minimax theorem in Ferguson 1967, p. 81. Dropout Training is Distributionally Robust Optimal noise, is 1 n i=1 EQ [ℓ(X ξ, Y, bθ(δ))|X = xi, Y = yi], (17) which is equivalent to the objective function of dropout noise that we presented in equation (5), evaluated at the parameters estimated using dropout training; that is i=1 Eδ h ln f yi|xi ξi, bβ(δ), bφ(δ) i . A typical concern about estimation procedures is whether their performance in a specific sample translates to good performance out of sample. In our context, the out-of-sample performance of dropout training can be thought of as the expected loss that would arise for some other data distribution Q over (X, Y ) at the parameter estimated via dropout E Q[ℓ(X, Y, bθ(δ))]. The following is a direct corollary of Theorem 1: Corollary 1 Let bθ(δ) = (bβ(δ) , bφ(δ)) denote the dropout estimators of β and φ given dropout noise δ. Consider any distribution Q over (X, Y ) that can be obtained from b Pn by perturbing covariates with mean-one, independent multiplicative error ξj [0, (1 δ) 1]. That is, if Q corresponds to the distribution of a vector of the form ( X1ξ1, . . . , Xdξd, Y ) , where ( X1, . . . , Xd, Y ) b Pn and (ξ1, . . . ξd) is a vector of i.i.d. random variables (indepen- dent of the empirical distribution of the data), supported on [0, (1 δ) 1]. Then, E Q[ℓ(X, Y, bθ(δ))] 1 i=1 Eδ h ln f yi|xi ξi, bβ(δ), bφ(δ) i . (18) This means that the objective function used for dropout training will provide an upper bound for the out-of-sample loss associated with multiplicative perturbations of the training Finally, we note that Theorem 1 was stated for a scalar δ that is homogeneous across the multiplicative noise ξj. To model non-identical dropout noise, we can substitute the sets in Equations (10) and (11) by Qj(δj) for a collection of parameters (δ1, . . . , δd) [0, 1)d. In this case, the results of Theorem 1 hold with Q j = (1 δj) 1 Bernoulli(1 δj) for j = 1, . . . , d. Blanchet, Kang, Olea, Nguyen and Zhang 5. Statistical Guidance on Choosing δ Theorem 1 above showed that dropout training is distributionally robust optimal and that nature s least favorable distribution is dropout noise with probability δ [0, 1). This section suggests a strategy to pick this parameter to control the generalization error of dropout training. 5.1 Additional Notation Let bφn denote an arbitrary n-asymptotically normal estimator for the scale parameter φ. Such an estimator can be the maximum likelihood estimator, which is known to be consistent and asymptotically normal under mild regularity conditions on the joint distribution of (Xi, Yi) (Fahrmeir and Kaufmann, 1985). We also allow bφn to be the dropout training estimator of φ but with a dropout probability of c/ n. Finally, as we did before, we let bβ(δ) denote the dropout estimator of β under dropout probability δ. In this section, we assume that the observed data was generated by a generalized linear model. Let (β , φ ) denote the parameters that were used to generate the training data, and let P denote the corresponding joint distribution of covariates and outcomes under the true data generating process (hence, the training sample consists of n i.i.d. draws from A key quantity in this section is Ln(β, φ, δ) 1 i=1 Eδ [ ln f(yi|xi ξi, β, φ)] , (19) which equals the in-sample (or empirical) loss at parameters β and φ, but averaged over the dropout noise with dropout probability δ. To speak about the generalization error of dropout training, we define the population loss at (β, φ) as L(β, φ) EP [ ln f(Y |X, β, φ)] . (20) L(bβ(δn), bφn) (21) is the loss that will arise from evaluating the negative log-likelihood in Equation (1) at (bβ(δn), bφn) and then averaging over values of (X, Y ) drawn according to P . This corre- sponds to the definition of generalization error used in Equation (2), evaluated at P . If the negative log-likelihood is used as a measure of performance of the estimators (bβ(δn), bφn), then an upper bound on (21) can be thought of as an upper bound on the generalization error of (bβ(δn), bφn). Dropout Training is Distributionally Robust Optimal The main result in this section shows that the event Ln(bβ(δn), bφn, δn) L((bβ(δn), bφn)), (22) can be guaranteed to hold with probability at least 1 α, for an appropriate choice of the sequence δn. This means that we can upper bound the generalization error of the estimators (bβ(δn), bφn) with probability at least 1 α, as we explain next. 5.2 Additional Assumptions and Main Result of this Section Assumption 2 The log-partition function Ψ( ) has a bounded second derivative. Assumption 3 The second moment matrix EP [XX ] is finite, positive definite. Assumption 4 For some σ2: n Ln(β , bφn, 0) L(β , φ ) d N(0, σ2). We can now state this section s main result. Define the random variable µn(β, φ, δ) Ln(β, φ, δ) Ln(β, φ, 0), (23) which is nonnegative for any δ [0, 1) by Theorem 1. Theorem 2 Suppose that Assumptions 2, 3, 4 hold. Then for any sequence δn = c/ n, n Ln(bβ(δn), bφn, δn) L((bβ(δn), bφn) d N(µ (β , φ , c), σ2), where µ (β , φ , c) 0 is a linear function of c, strictly increasing for almost every (β , φ ), and equal to the probability limit of nµn β , bφn, δn , and µn( ) is defined as in Equation (23). Proof See Appendix A.3. The main message of Theorem 2 is that the probability of the event Ln(bβ(δn), bφn, δn) L((bβ(δn), bφn) Blanchet, Kang, Olea, Nguyen and Zhang can be approximated, as the sample size goes large, by the probability under a normal random variable with positive mean of the positive half of the real line. That is: P Ln(bβ(δn), bφn, δn) L((bβ(δn), bφn) = P N(µ (β , φ , c), σ2) 0 + o(1). Since µ is nonnegative, it is a strictly increasing function of c for almost every (β , φ ), then c can be chosen to guarantee that the probability of the first term in the right-hand side of the equality is as close to one as desired. This means that we can choose the dropout probability to guarantee that the generalization error associated with the estimator (bβ(δn), bφn) which we have denoted as L(bβ(δn), bφn) admits an upper bound (estimable based on information in the sample) with high probability. See Equation (29) for our explicit recommendation on how to choose δ = c/ n based on Theorem 2. Some elementary algebra can be used to illustrate the main argument behind the proof. It is convenient to start analyzing the difference n Ln(bβ(δn), bφn, δn) L(β , φ ) instead of n Ln(bβ(δn), bφn, δn) L((bβ(δn), bφn) . Algebra shows that n Ln(bβ(δn), bφn, δn) L(β , φ ) = n Ln(bβ(δn), bφn, δn) Ln(β , bφn, δn) (24) + nµn(β , bφn, δn) (25) + n Ln(β , bφn, 0) L(β , φ ) . (26) We show that the sum of these terms converges in distribution to a normal. The proof has three main steps that we explain below. Step 1. We start by showing that term (24) converges in probability to zero. To do this, we show that n(bβ(c/ n) β ) (27) is asymptotically normal and that the derivative of Ln( ) with respect to β (evaluated at β ) converges in probability to zero. One important observation is that mean of the asymptotic distribution of Equation (27) is nonzero, and depends linearly on c. Consequently, even though the dropout training estimator is asymptotically normal, its limiting distribution exhibits bias (and the norm of such bias increases linearly as a function of c2). This first step thus shows that choosing a larger dropout probability comes at the cost of a decrease in accuracy in the estimation of β . Our recommended procedure for choosing c will guarantee Dropout Training is Distributionally Robust Optimal that the generalization error of dropout training will be bounded with some prespecified probability, despite the decrease in estimation accuracy. Step 2. We then show that term (25) has a finite probability limit. This is the key step in the proof. Importantly, we can characterize this limit explicitly and show that nµn(β , bφn, δn) p c µ, ξ A EP [Ψ((X ξ) β )] d EP [Ψ(X β )] + EP [Y X ]β and A is the collection of all vectors in {0, 1}d for which there is only one zero. In Section 7, we provide an expression for this term in the linear regression model. Step 3. The term (26) above has, by Assumption 4, an asymptotically normal distribution with mean zero and variance σ2. Step 4. Finally, in order to establish Theorem 1 we show that n L((bβ(δn), bφn)) L(β , φ ) = o P (1), (28) which means that the expected loss evaluated at (bβ(δn), bφn) converges in probability to L(β , φ ) at a rate of at least n 1/2. It is important to mention that the DRO interpretation of dropout training can be leveraged to select the dropout parameter δ. For example, a possible approach is choosing δ so that the true data-generating process belongs to nature s choice set with some prespecified probability. This approach, which is often advocated in the literature on machine learning and robustness (Hansen and Sargent, 2008), often leads to a very pessimistic selection of δ simply because this criterion is not informed at all by the loss function defining the decision problem. Further, in our problem, it is impossible to apply this approach, given that the set of multiplicative perturbations of the empirical distribution will generally not cover the true data-generating process. Another approach involves using generalization bounds leading to finite sample guar- antees; see, for instance, a summary of this discussion in Section 6.2 of Rahimian and Mehrotra (2019). This method, while appealing, often requires either distributions with compact support or strong control on the tails of the underlying distributions. Also, often, the bounds depend on constants that may be too pessimistic or too difficult to compute. Finally, Blanchet et al. (2019b) recently introduced a method for the case in which nature s choice set is defined in terms of the Wasserstein distance around the empirical Blanchet, Kang, Olea, Nguyen and Zhang distribution. The idea therein is that for a fixed δ every distribution that belongs to nature s choice set corresponds to an optimal parameter choice for the statistician. Thus, one can collect each and every of the statistician s optimal choices associated with each distribution in nature s uncertainty set, and treat the resulting region as a confidence set for the true parameter. This confidence set grows bigger (in the sense of nested confidence regions) as δ increases. The goal is then to minimize δ subject to a desired level of coverage in the underlying parameter to estimate. This leads to a data-driven choice of δ explicitly linked to the statistician s decision problem. However, this approach is not feasible for our problem because, once again, regardless of the value of δ, the parameter choices for each of the multiplicative perturbations of the empirical distribution will generally fail to cover the true parameter. 5.3 Our recommendation for choosing δ. We advocate choosing the parameter δ to control how often the in-sample loss obtained from dropout training exceeds the population loss. The proof of Theorem 2 shows that µ (β , φ , c) is of the form c µ, where µ depends on (β , φ ). Consequently, as long as µ > 0, it is straightforward to pick c to guarantee a pre-specified coverage of the population loss: for any α (0, 1), if we pick c to be z1 α σ/µ, (29) where z1 α is the 1-α quantile of a standard normal, and µ is evaluated at consistent estimators of β and σ then the probability of the event (22) asymptotically approaches 6. An Algorithm for Dropout Training The goal of this section is to suggest an algorithm for solving the dropout training problem inf θ Θ EQ [ℓ(X ξ, Y, θ)], where Q = b Pn Q 1 . . . Q d and Q j , j = 1, . . . , d is the dropout noise distribution defined in (14). Notice that we here consider the specific case in which Q0 is set to the empirical measure b Pn supported on n training samples {(xi, yi)}n i=1. We will use θ n to denote the solution of the dropout training problem above. It will sometimes be convenient Dropout Training is Distributionally Robust Optimal to rewrite this dropout training problem as i=1 EQ [ℓ(X ξ, Y, θ) | X = xi, Y = yi], (30) which coincides with expression (5). Conditioning on the values of (xi, yi) makes it clear that the expectation is computed over the d-dimensional vector ξ. We now briefly describe three common approaches to implement dropout training and discuss some of its limitations. 6.1 Naive Dropout Training Because Q j places mass on only two points, namely 0 and (1 δ) 1, the support of the joint distribution Q 1 . . . Q d has cardinality 2d. Thus, a na ıve approach to solving the dropout training problem specified by Equation (30) is to expand the objective function as a sum with n 2d terms, then apply a tailored gradient descent algorithm to the resulting optimization problem. Computationally, however, this approach is too demanding because the number of individual terms in the objective function grows exponentially with d. 6.2 Dropout Training via Stochastic Gradient Descent Another method to solve the dropout training problem in Equation (13) is to use stochastic gradient descent (Robbins and Monro, 1951). This is tantamount to i) taking a draw of (xi, yi) according to its empirical distribution, ii) independently taking a draw of ξi using the distribution in Equation (3), and iii) computing the stochastic gradient descent update ln f(yi|xi ξi, β, φ). Given a current estimate bθ, we compute an unbiased estimate of the gradient to the objective function of (13), and move in the direction of the negative gradient by a step of suitable size. Since Q is discrete, the expectation under Q can be written as a finite sum, and, by differentiating under the expectation, we have θEQ [ℓ(X ξ, Y, bθ)] = EQ h θℓ(X ξ, Y, bθ) i . (31) The standard SGD algorithm uses a na ıve Monte Carlo estimator as an estimate of the gradient (31), that is, at iteration k N with incumbent solution bθk, θEQ [ℓ(X ξ, Y, bθk)] θℓ(xk ξk, yk, bθk), where (xk, yk, ξk) is an independent draw from Q . Blanchet, Kang, Olea, Nguyen and Zhang One drawback of using SGD for our problem is that it is not easily parallelizable, and thus its implementation can be quite slow. Moreover, under the strong convexity assump- tion of the loss function ℓ, SGD exhibits only linear convergence (Nemirovski et al., 2009, Section 2.1). In contrast, gradient descent (GD) offers exponential convergence (Boyd and Vandenberghe, 2004, Section 9.3.1). 6.3 Na ıve Monte Carlo Approximation for Dropout Training Consider solving dropout training problem (30) using a na ıve Monte Carlo approximation. Instead of using 2d terms to compute EQ [ℓ(X ξ, Y, θ) | X = xi, Y = yi], we approximate this expectation by taking a large number of K i.i.d. draws {ξk i }K k=1, ξk i Rd, according to the distribution Q 1 . . . Q d . When d is large, this approximation is computationally cheaper than the na ıve dropout training procedure described above, provided that K 2d. Thus, the na ıve Monte Carlo approximation of the dropout training problem is min θ Θ 1 n k=1 ℓ(xi ξk i , yi, θ) where the random vectors ξk i are sampled independently over both k and i using the distribution Q 1 . . . Q d . Relative to the solution of the dropout training problem which we denoted as θ n the minimizer of approximation (32) is consistent and asymptotically normal as K . This follows by standard arguments; for example, those in Shapiro et al. (2014, Section 5.1). There are, however, two problems that arise when using problem (32) as a surrogate for the dropout training problem. First, the solution to approximation (32) is a biased estimator for θ n . This means that, if we average the solution over the K n different values of ξk i , the average solution need not equal θ n . Second, implementing approximation (32) requires a choice of K and, to the best of our knowledge, there is no off-the-shelf procedure for picking 6.4 Unbiased Multi-level Monte Carlo Approximation for Dropout Training To address these two issues, we apply the recent techniques suggested by Blanchet et al. (2019a) that we refer to as unbiased multi-level Monte Carlo approximations. Multi-level Monte Carlo methods (Giles, 2008, 2015) are a set of techniques for approximating the ex- Dropout Training is Distributionally Robust Optimal pectation of random variables. The adjective multi-level emphasizes the fact that random samples of different levels of accuracy are used in the approximation. Before presenting the detailed algorithm, we provide a heuristic description. To this end, let bθ n (K) denote the level K solution of the problem in (32); that is, the solution based on K draws. Define the random variable K bθ n (K) bθ n (K 1) and, for simplicity, assume that bθ n (0) is a vector of zeros. Under suitable regularity condi- K=1 E[ K] = lim K E[bθ n (K)] = θ n . Consider now picking K randomly from some discrete distribution supported on the natural numbers. Let p( ) denote the probability mass function of such distribution and consider a Monte Carlo approximation scheme in which after drawing K we sample K n different random vectors ξk i Rd according to Q 1 . . . Q d . The estimator has two sources of randomness. These are, firstly, the random choice of K and, secondly, the random draws ξk i . Averaging over both yields K=1 E[Z(K )|K = K] p(K) = K=1 (E[ K]/p(K)) p(K) = θ n . Thus, by taking into account the randomness in the selection of K, we have managed to provide a rule for deciding the number of draws (specifically, our recommendation is to pick K at random), and, at the same time, we have removed the bias of na ıve Monte Carlo approximations. One possible concern with our suggested implementation is that the expected computa- tional cost of Z(K ) could be infinitely large. Fortunately, this issue can be easily resolved by an appropriate choice of the distribution p( ). To see this, define the computational cost simply as the number of random draws that are required to obtain Z(K ). In the construc- tion we have described above, we need K n draws for the construction of the estimator. Thus, the average cost is Blanchet, Kang, Olea, Nguyen and Zhang which, under mild integrability conditions on p( ), will be finite.10 We now present the algorithm that will be used to solve the dropout training problem. To ensure that the estimator Z(K ) has a finite variance, instead of defining K as the difference between the level K and K 1 solutions to the approximation problem (32) in the above heuristic arguments, we use solutions for a sample of size 2K+1 and with its odd and even sub-samples of size 2K. 6.5 Algorithm for the Unbiased Multilevel Monte Carlo We present a parallelized version using L processors, which works even when L = 1. Parallel computing reduces the variance of the estimator, so we suggest using as many processors as are available per run. Fix an integer m0 N such that 2m0+1 2d. For each processor l = 1, . . . , L we consider the following steps. i) Take a random (integer) draw, m l , from a geometric distribution with parameter ii) Given m l , take 2K l +1 i.i.d. draws from the d-dimensional vector ξi Q 1 . . . Q d , K l m0 + m l . Repeat this step independently for each i = 1, . . . , n. iii) Solve problem (32) using the first 2m0 i.i.d. draws of ξi for each i. Let θl,m0 denote a iv) Denote by bθ n (2K l +1), bθO n (2K l ), and bθE n (2K l ) any solution to the following optimiza- tion problems (all of which are based on sample average approximations as the Monte 10. For example, if p( ) is selected as a geometric distribution with parameter r, the expected computational cost will be n(1 r)/r. 11. To see why we require that r > 1/2, notice that, if the computational cost of evaluating Z(K ) (as in the heuristic description above) increases exponentially in K and takes the form C 2K, the expected computational cost will be X K=1 Cr(2(1 r))K = Cr(1/2(1 r)), provided 2(1 r) < 1, or equivalently, r > 1/2. As we show in the proof of Theorem 3, constraining the variance requires then imposing r < 3/4. Ultimately, optimizing the product of computational cost and variance leads to the optimal selection r = 1 2 3/2. Dropout Training is Distributionally Robust Optimal Carlo approximation 32): bθ n (2K l +1) arg min θ Θ 1 n k=1 ℓ(xi ξk i , yi, θ) bθO n (2K l ) arg min θ Θ 1 n k=1 ℓ(xi ξ2k 1 i , yi, θ) bθE n (2K l ) arg min θ Θ 1 n k=1 ℓ(xi ξ2k i , yi, θ) Intuitively, bθO n and bθE n denote the solutions to problem (32) but using a sample of size 2K l with only odd indices and even indices, respectively. v) Define K l bθ n (2K l +1) 1 2(bθO n (2K l ) + bθE n (2K l )) Z(K l ) = K l r(1 r)K l m0 + θl,m0. Our recommended estimator is l=1 Z(K l ). We now show that the suggested algorithm gives an estimator with desirable properties. We do so under the following regularity assumptions. Assumption 5 Suppose that the parameter space Θ is compact. Suppose in addition that the optimal solution θ n to the dropout training problem in (30) is (globally) unique. Assumption 6 Let bθ n (K) denote the solution of the problem in (32) based on K draws. Suppose that as K , E[ K 1 2 (bθ n (K) θ n ) 4 2) = O(1), where the expectation is taken over the i.i.d. dropout noise distribution used to generate ξk i . Assumption 7 Assume that, for each (X, Y, ξ), ℓ(X ξ, Y, ) is thrice continuously differ- entiable over Θ and that the Hessian matrix θθEQ [ℓ(X ξ, Y, θ n )] is non-singular. Blanchet, Kang, Olea, Nguyen and Zhang Theorem 3 Under Assumption 5, E[Z(K l )] = θ n . The number of random draws required to compute Z(K l ) is n 2K l +1 and thus the expected computational complexity for producing Z(K l ) equals n(2m0+1)r 2r 1 < n(2m0+1) n2d. Suppose, in addition, that bθ n (K) is almost surely in the interior of Θ for large enough K. If Assumptions 6 and 7 hold, and r < 3/4, then Var(Z(K l )) < . Proof See Appendix A.4. Our suggested algorithm has finite expected computational complexity that does not grow exponentially with the dimensionality d, thus, every time we need to obtain bθ n (2K l +1), we can do so by gradient descent. Combined with parallelization, the unbiased multi-level Monte Carlo approach produces an unbiased estimator with a variance that can be made arbitrarily small if L is large enough, provided that the regularity assumptions that give Var(Z(K l )) < hold. 7. Numerical Experiments We conduct numerical experiments in this section to compare our preferred implementation of dropout training to stochastic gradient descent, as well as our recommended selection of δ to cross-validation. The benefits of our suggested unbiased multi-level Monte Carlo algorithm are analyzed using high-dimensional regression, whereas our selection of δ is analyzed using a low-dimensional regression model. 7.1 Advantage of the Unbiased Multi-level Monte Carlo Estimator We present a simple numerical experiment to illustrate the advantage of using the unbiased multi-level Monte Carlo estimator suggested in Section 6.4. We consider the linear regression problem with known variance and we focus on solving the dropout training problem with our recommended δ chosen according to Proposition 2. Our simulation setting considers a linear regression model with a covariate vector having dimensionality d = 100 and sample size n = 50. We pick a known regression coefficient β0 Rd being a vector with all entries equal to 1. With fixed coefficients, we assume the covariate vector follows an independent Gaussian, as well as for the regression noise. More specifically, we can get our 50 observations (xi, yi) via sampling xi N(0, Id), i = 1, . . . , n, Dropout Training is Distributionally Robust Optimal sampling yi R conditional on xi, where yi is given by the linear assumption and εi are i.i.d. random noise following N(0, 102), for i = 1, . . . , n. Our simulation first considers a high-dimension setting (relatively low ratio between sample size per dimension n/d = 0.5) with high noise-to-signal ratio (variability on residual noise is high compared to the variability on xi). If we set Q0 to be the empirical distribution of {(xi, yi)n i=1}, the dropout training problem in the linear regression model is min β Rd EQ β (X ξ) Y 2 . Corollary 2 in Appendix A.5 shows that, in the linear regression model, the dropout training problem can be written as min β Rd 1 n (Y Xβ) (Y Xβ) + δ 1 δβ Λβ , where Y = [y1, y2, . . . , yn] , X = [x1, x2, . . . , xn] and Λ is the diagonal matrix with its diagonal elements given by the diagonals of X X. Moreover, there is a closed-form solution for the dropout training problem and it is given by the ridge regression formula: β n = X X + δ 1 δΛ 1 X Y. We choose the dropout probability δ following Proposition 2. More specifically, Propo- sition 2 suggests the choice δ = c/ n where c = z1 α σ/µ. For linear regression with known variance, it is straightforward to compute j=1 EP [X2 j ](β j )2, σ2 = Var P 1 2 log(2πφ ) + (Y (β ) X)2 Choosing α = 0.1 and note that β = β0, φ = 102, we have δ 0.26. Since neither our suggested multi-level Monte Carlo algorithm nor standard SGD (as defined in Section 6.2) uses closed-form formulae for their implementation, we analyze the extent to which these procedures can approximate the parameter β n. The two algorithms we compare are: Blanchet, Kang, Olea, Nguyen and Zhang Figure 1: l2 difference Standard SGD algorithm with a learning rate 0.0001 and initialization at the origin. Note that we take batched SGD instead of the single-sample SGD introduced in Section 6.2. Multi-level Monte Carlo algorithm with a geometric rate r = 0.6 and a burn-in period m0 = 5. Note that in each parallel run, we use gradient descent (GD) with 0.01 learning rate and initialization at origin for steps iii) and iv) in Section 6.4. We run our simulation on a cluster with two Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40 GHz processors (with 10 cores each), and a total memory of 128 GB. We fix 60 seconds as a wall-clock time , so that we terminate the two algorithms after 60 seconds.12 We run 1000 independent experiments. For each run, we calculate and report the average parameter estimation divergence to β n and 1-standard deviation error bar for the divergence. We consider different number of parallelizations (i.e., L in Section 6.4) from 400 to 2400. We cap the run at 2400 due to the saturation of divergence after 2000 parallelizations. Figure 1 shows the l2 divergence from the true β n of the two algorithms for varying L, while Figure 2 and Figure 3 show the l and l1 divergence, respectively. We observed that our unbiased estimator outperforms the standard SGD algorithm once the number of parallel iterations exceeds some moderate threshold ( 1000 here). We provide supporting evidence in Appendix A.7 to argue our choice of the learning rate, initialization, and wall- clock time, where our proposed algorithm is robust to any reasonable choices. 12. The parameters for the SGD algorithm are appropriately tuned to achieve good convergence within 60 s (see Appendix A.7 for the tuning procedure). We do not claim that this choice of parameters is optimal. Dropout Training is Distributionally Robust Optimal Figure 2: l difference Figure 3: l1 difference Blanchet, Kang, Olea, Nguyen and Zhang 7.2 Coverage of the True Loss of Dropout Training Here, we validate that our recommended selection of δ guarantees that the in-sample loss of dropout training is covering the true loss with arbitrarily high probability as prescribed by Proposition 2. We use the same linear regression model with dimensionality d = 10 and n {103, 104} training samples. We choose different quantiles of the normal as in Proposition 2. We also include 10-fold cross-validation and ordinary least squares for comparison; see Table 1, where we estimate the frequency of coverage over 1000 independent runs. The main message is that our suggested choice of δ guarantees that the in-sample loss of dropout training exceeds the true, unknown, population loss with probability 1 α. When using standard OLS or choosing δ by cross-validation, the in-sample loss is smaller than the population loss with probability close to 1/2, which implies that these methods are unsatisfactory in terms of frequency of coverage. α = 0.2 α = 0.1 α = 0.05 10-fold CV plain OLS n = 103 0.77 0.01 0.88 0.01 0.94 0.01 0.52 0.02 0.40 0.01 n = 104 0.79 0.01 0.90 0.01 0.94 0.01 0.49 0.02 0.47 0.02 Table 1: Frequency of in-sample loss covering the true population loss. Our recommended selection of δ = c/ n with c = z1 ασ/µ has a theoretical 1 α coverage probability. 8. Extensions In this section, we discuss the extent to which the decision-theoretic support for dropout training carries over to neural networks. The main idea is to use a GLM model where the natural parameter is no longer a linear function of the covariates, but instead a neural network. We note that dropout training in neural networks typically drops neuron activa- tions in both the input layer and the hidden layer. Dropout only at the input layer (data augmentation) is more closely related to our analysis in the context of GLM models and is also considered in the literature, for example by Devries and Taylor (2017) and Park et al. (2019) for vision and speech recognition tasks. In the remaining part of this section, we consider feed-forward neural network with one hidden layer, and consider dropout noise in both the input and hidden layers. We also note that there are already concrete recommendations for how to select the dropout probability in neural networks. For example, Goodfellow et al. (2016, p. 257) says, The probability of sampling a mask value of one (causing a unit to be included) is a hyperparameter fixed before training begins. It is not a function of the current value of Dropout Training is Distributionally Robust Optimal the model parameters or the input example. Typically, an input unit is included with a probability 0.8, and a hidden unit is included with a probability 0.5. We then run forward propagation, back-propagation, and the learning update as usual. It is not entirely clear to us what the theoretical support is for this recommendation. We think that our results for choosing δ in GLM models (which depends on the sample size, and on the estimated values of the parameters of the model) provide, at the very least, a principled alternative to select the dropout probability of features in the hidden layer that precedes the output layer. Our simulations for GLM models (Table 1) suggest that this strategy indeed gives good bounds on generalization error. 8.1 One-hidden-layer Feed-Forward Neural Networks Suppose the scalar response variable Y is generated by the conditional density f(Y |X, θ, φ) h(Y, φ) exp ((Y Ωθ(X)) Ψ(Ωθ(X))) /a(φ)) , (33) where Ωθ(X) is a neural network with parameters θ and X Rd. This is a simple extension of the regression model that has been used recently to study deep neural networks see Schmidt-Hieber (2020) in which the conditional density is Gaussian. In this section, we will assume that Ωθ(X) is a neural network with a single hidden layer, a differentiable activation (squashing) function, and a linear output function. A function h : R [0, 1] is a squashing function if it is non-decreasing and if lim r h(r) = 1, lim r h(r) = 0. For further detail, see Hornik et al. (1989, Definition 2.3). Although these types of networks which will be formally described below are restric- tive compared to the modern deep learning architectures, they can approximate any Borel measurable function from a finite-dimensional space to another, provided the hidden units in the hidden layer are large (Hornik et al., 1989). Consider a neural network with K units in the hidden layer, each using input weights wk Rd, k = 1, . . . , K. Denote the activation function in the hidden layer as h( ). Assume a linear output function with a vector of weights β RK. Thus, the network under consideration is defined by the function: Ωθ(X) β1h(w 1 X) + . . . + βKh(w KX) = β H(X), Blanchet, Kang, Olea, Nguyen and Zhang where H(X) = (h(w 1 X), . . . , h(w KX)) . The neural network is parameterized by θ (β , w 1 , . . . , w k ) . Under this model, the distribution of Y |X is a GLM model with co- variates H(X). 8.1.1 Statistician s Objective Function We will endow the statistician with the loss function given by the negative of the conditional log-likelihood for the model in Equation (33). 8.1.2 Nature s Uncertainty Set We allow nature to introduce additional noise to the statistician s model. We do this in two steps. First, we allow nature to distort the distribution of X using a multiplicative noise denoted as ξ(1) Rd. This is exactly analogous to what we did in the GLM model, where nature was allowed to pick a distribution for the covariates of the form (X ξ(1)). We allow nature to contaminate the input layer with independent and multiplicative noise. Second, we also allow nature to contaminate each of the hidden units with multiplicative noise ξ(2) RK. That is, nature is also allowed to pick a vector ξ(2) = (ξ(2)1, . . . , ξ(2)K) , independently of ξ(1) Rd, to distort the each of the K units in the hidden layer as H(X) ξ(2) (h(w 1 X)ξ(2)1, . . . , h(w KX)ξ(2)K) . Our choice of a one-hidden neural network was simply for expositional simplicity, but the analysis would be the same with a feed-forward neural network with L hidden layers. 8.1.3 Minimax Solution The minimax solution of the DRO game is given by inf θ sup Q EQ [ ln f(Y |H(X ξ1) ξ2, β, φ)] , (34) where Q now refers to the joint distribution of (X, Y, ξ(1), ξ(2)) and f(Y |X, β, φ) is the GLM density defined in (1). We continue working with the assumption that ξ (ξ(1) , ξ(2) ) has independent marginals and that it is independent of (X, Y ). We would like to solve for the worst-case distributions of the random vectors ξ(1) and ξ(2), assuming that both of these satisfy the restrictions analogous to Equation (11). The solution for the distribution of ξ(2) can be obtained as a corollary to Theorem 1, as it suffices to define X H(X ξ(1)) Dropout Training is Distributionally Robust Optimal and to view (34) as the DRO problem in a linear regression model, in which the data is ( X, Y ) and ξ(2) RK is simply the multiplicative noise that transforms the covariates into The worst-case choice of ξ(1), the multiplicative error for the inputs, is more difficult to characterize and we were not able to find general results for it. Below, we provide a heuristic argument suggesting that dropout noise might approximate the worst-case choice when the output layer is a Gaussian linear model. Let ξ(1)j denote the j-th coordinate of ξ(1). Suppose that the distribution of this random variable places most of its mass on the interval [1 ϵ, 1 + ϵ].13 This allows us to linearize the output of each of the hidden units around the output corresponding to unperturbed inputs as h(w k (X ξ(1))) = h(w k (X (ξ(1) 1)) + w k X) h(w k X) + h(w k X) (wk X) (ξ(1) 1) . In the notation above, 1 denotes the d-dimensional vector of ones. For the sake of exposition, ignore the approximation error in the linearization above. If we fix (X, Y, ξ(2)), then the worst-case choice for the distribution of ξ(1), denoted by Q(1), maximizes k=1 βk ξ(2)k j=1 wk,j Xj (ξ(1)j 1) among all distributions with independent marginals for which EQ(1)[ξ(1)j] = 1 for all j = 1, . . . , d. It can then be shown that such a maximization problem is equivalent to maximizing k=1 βk ξ(2)k h(w k X) j=1 wk,j Xj (ξ(1)j 1) which, in turn, can be written as a (ξ(1) 1) 2 for an appropriate choice of a vector a Rd that depends only on (β, ξ(2), h, h, w, X). Lemma 2 in Appendix A.2 shows that the solution to this problem is dropout noise. 13. This is compatible with dropout noise for which δ is very close to zero. Blanchet, Kang, Olea, Nguyen and Zhang 9. Concluding Remarks This paper examines dropout training, an increasingly popular estimation method in ma- chine learning. Dropout training is a fundamental part of modern machine learning tech- niques for training very deep networks (Goodfellow et al., 2016). Our main result (Theorem 1) established a novel decision-theoretic foundation for the use of dropout training. We showed that this method, when applied to generalized linear models, can be viewed as the minimax solution to an adversarial two-player, zero-sum game between a statistician and nature. The framework used in this paper is known in the stochastic optimization literature (Shapiro et al., 2014) as a distributionally robust optimization (DRO) problem. Our minimax result showed, by construction, that dropout training indeed provides out-of-sample performance guarantees for distributions that arise from multiplicative per- turbations of the in-sample data. Our result thus justified, explicitly, the ability of dropout training to enhance a predictor s out-of-sample performance, which is one of the reasons often invoked to promote the dropout method. In addition to our theoretical result, we also suggested a new strategy to select the dropout probability and a new stochastic optimization implementation of dropout training. For the latter, we borrowed ideas from the multi-level Monte Carlo literature in particular from Blanchet et al. (2019a) to suggest an unbiased dropout training routine that is easily parallelizable and that has a smaller computational cost than na ıve dropout training meth- ods when the number of features is large (Theorem 3). Crucially, we showed that under some regularity conditions, our estimator has finite variance (which means that there are also theoretical, and not just practical, gains from parallelization). The connection between dropout training and the multiplicative errors-in-variables model established in this paper is novel. We think this connection is potentially interesting because multiplicative errors have found a range of applications in empirical work across various dis- ciplines, from economics to epidemiology. For example, Alan et al. (2009) uses it to account for measurement error in consumption data when estimating the elasticity of intertemporal substitution via Euler equations. Pierce et al. (1992) and Lyles and Kupper (1997) use it to relate health outcomes to the exposure of a chemical toxicant that is observed with error. Moreover, due to privacy considerations, statistical agencies such as the U.S. Census Bureau sometimes mask data using multiplicative noise, as discussed by Kim and Winkler (2003) and Nayak et al. (2011). Examples of data sets that contain variables masked with multiplicative noise include the Commodity Flow Survey Data (2017) and the Survey of Business Owners (2012) both from the U.S. Census Bureau and the U.S. Energy Infor- mation Administration Residential Energy Consumption survey. Applications of dropout training in these contexts could be an interesting area for future work. Dropout Training is Distributionally Robust Optimal We also discussed the extent to which our theoretical results extended to Neural Net- works (in particular, to the universal approximators in Hornik et al. 1989 consisting of a single-hidden layer and a squashing activation function). Our results showed that Theorem 1 can be used to establish the optimality of dropout training to estimate the parameters of the last hidden layer in general feed-forward neural networks, where the output layer takes the form of a generalized linear model. We hope that our analysis serves as a foundation to understand the benefits of dropout training in neural networks. Acknowledgments We would like to thank Matias Cattaneo, Max Farrell, Michael Leung, Ulrich M uller, Mark Peletier, Hashem Pesaran, Ashesh Rambachan, Roger Moon, Frank Schorfheide, Stefan Wager, and participants at the Statistics Seminar series at Columbia University for helpful comments and suggestions. Jos e Blanchet acknowledges support from NSF grants 1820942, 1838576, 1915967, 2118199, 2229012, 2312204 and the Chinese Merchant Bank. Material in this paper is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-20-1-0397. Viet Anh Nguyen acknowledges the support from the CUHK s Improvement on Competitiveness in Hiring New Faculties Funding Scheme. Blanchet, Kang, Olea, Nguyen and Zhang Appendix A. This appendix collections all the proofs of main results, as well as additional theoretical and empirical results. A.1 Probability Limit of the Dropout Training Estimator for β Proposition 1 Suppose that Assumptions 2 and 3 hold. Let the data {(xi, yi)}n i=1 consist of n i.i.d. draws from a distribution P . Then for any sequence δn δ [0, 1) as n , bβ(δn) converges in probability to β (δ) arg minβ EP [Eδ [ ln f(Y |X ξ, β, φ)]] , (36) where the minimizer in (36) is unique and does not depend on φ. Proof The dropout estimator of β maximizes i=1 yi(β xi) Eδn[Ψ(β (xi ξ))]. In a slight abuse of notation, let ξδ denote a realization of dropout noise parameterized by δ. Then it is possible to re-write the objective function as a weighted average of the functions Qn,ξδn(β) 1 i=1 yi(β xi) Ψ(β (xi ξδn)). It will be convenient to define the limiting objective function to be Q(β) EP [Y (β X)] EP [Eδ[Ψ(β (X ξ))]], which, by Assumptions 1 and 2, is finite and strictly concave. The population objective function is then the average (over dropout noise) of Qξδ(β) EP [Y (β X)] EP [Ψ(β (X ξδ))]. It is straightforward to show that β (δ) in Equation (36) denotes the unique maximizer of The proof of the proposition follows from standard arguments in the theory of extremum estimators. In particular, it suffices to verify the conditions of Theorem 2.7 in Newey and Mc Fadden (1994). Dropout Training is Distributionally Robust Optimal Condition i) in Newey and Mc Fadden (1994) requires Q(β) to be uniquely maximized at β (δ). This holds because Assumptions 1 and 2 imply that Q(β) is strictly concave. Condition ii) in Newey and Mc Fadden (1994) requires β (δ) to be an element in the interior of a strictly convex set, which holds because, in the GLM models under consider- ation, the parameter space is Rd. Furthermore, Qn(β) is trivially concave by Assumption Condition iii) requires Qn(β) to converge in probability to Q(β) for every β. For this purpose, it suffices to show that Qn,ξδn(β) converges in probability to Qξδ(β) for each fixed β, and for a sequence ξδn and ξδ that have zeros and non-zeros in exactly the same entries. Assumptions 1 and 2 imply EP [Y (β X)] < for all β. Thus, using the Law of Large Numbers for i.i.d. sequences, i=1 Yi(β Xi) p EP [Y (β X)]. Finally, Assumptions 1 and 2 imply that the triangular array Zn,i = Ψ(β (Xi ξδn)), 1 i n, satisfies the conditions for the Law of Large Numbers for triangular arrays (Durrett, 2019, Theorem 2.2.11), and, consequently, i=1 Ψ(β (Xi ξδn)) p EP h Ψ(β (X ξδ)) i . This completes the proof. A.2 Proof of Theorem 1 The proof of Theorem 1 relies on the following two preparatory results. Lemma 1 (Extremal expectation of a univariate convex function) For any < a < b < + , let ζ be a random variable in [a, b] with mean µ [a, b]. For any convex, continous function f : [a, b] R, the distribution of ζ that maximizes E[f(ζ)] among all distributions over [a, b] with a given mean µ [a, b] is a scaled and shifted Bernoulli distri- bution, i.e., a with probability (b µ)/(b a), b with probability (µ a)/(b a). (37) Blanchet, Kang, Olea, Nguyen and Zhang Proof Let Q denote the probability measure induced by the random variable in (37). By EQ [f(ζ)] = b µ b af(a) + µ a Suppose first that µ = a. In this case, Jensen s inequality implies that for any other probability measure Q over [a, b] with mean µ = a, EQ[f(ζ)] f(EQ[ζ]) = f(a) = EQ [f(ζ)]. An analogous result holds if µ = b. Consider then the case in which µ (a, b). For an arbitrary probability measure Q over [a, b] with mean µ (a, b), we have [a,b] f(ζ)d Q = Z [a,b] f ab ζ b af(a) + ζ a b a f(b) d Q, where the inequality follows from the convexity of f. By the linearity of the integral operator and the fact that R [a,b] ζd Q = µ, we find [a,b] f(ζ)d Q b µ b af(a) + µ a Because the probability measure Q was chosen arbitrarily, this implies that the distribution of ζ in Equation (37) maximizes the expectation of f(ζ). Lemma 2 Fix a vector of tuning parameters δ (0, 1)d. Let Qj(δj) be defined as in (10). Suppose that A is a convex and continuous function on R. For any θ Rd, we have sup n EQ1 ... Qd[A(θ ξ)] : Qj Qj(δj) o = EQ 1 ... Q d [A(θ ξ)], where Q j is a scaled Bernoulli distribution of the form Q j = (1 δj) 1 Bernoulli((1 δj)) for each j = 1, . . . , d. Proof First, note that Q j Qj(δj) for each j, and thus Q 1 . . . Q d is a feasible solution to the maximization problem. It suffices to show that, for any set of feasible measures Qj Qj(δj), j = 1, . . . , d, we have EQ1 ... Qd[A(θ ξ)] EQ 1 ... Q d [A(θ ξ)]. Dropout Training is Distributionally Robust Optimal Towards this end, pick any k {1, . . . , d}. By Fubini s theorem, we can write EQ1 ... Qd[A(θ ξ)] = EQ1 ... Qk 1 Qk+1 ... Qd EQk[A(θ ξ)]. For any fixed value (ξ1, . . . , ξk 1, ξk+1, . . . , ξd) the function ξk 7 A(P j =k θjξj + θkξk) is convex in the variable ξk over the interval [0, (1 δk) 1]. Thus by Lemma 1, j =k θjξj +θkξk)] EQ k [A( X j =k θjξj +θkξk)] for any fixed (ξ1, . . . , ξk 1, ξk+1, . . . , ξd). Thus by the monotonicity of the expectation operator, EQ1 ... Qd[A(θ ξ)] EQ1 ... Qk 1 Qk+1 ... Qd EQ k [A(θ ξ)] = EQ1 ... Qk 1 Q k Qk+1 ... Qd[A(θ ξ)]. By cycling through all possible values of k {1, . . . , d} we conclude that EQ1 ... Qd[A(θ ξ)] EQ 1 ... Q d [A(θ ξ)]. Therefore, the postulated claim holds. We are now ready to prove Theorem 1. Proof Note that for Q U(Q0, δ), Assumption ?? implies EQ[ℓ(X ξ, Y, θ)] is finite for any θ Θ and any scalar δ [0, 1). Therefore, from Fubini s theorem and the definition of the loss function: EQ[ℓ(X ξ, Y, θ)] = EQ0 [EQ1 ... Qd[ℓ(X ξ, Y, θ)]] = EQ0 h EQ1 ... Qd[ ln h(Y, φ) + (Ψ(β (X ξ)) Y (β (X ξ)))/a(φ)] i = EQ0 [ln h(Y, φ)] + EQ0 h EQ1 ... Qd[(Ψ(β (X ξ)) Y (β (X ξ)))/a(φ)] i . It can then be shown that, for any β, X and ξ: β (X ξ) = (β X) ξ. Thus, we can fix the values of (X, Y, θ) and define the function A(X,Y,θ)((β X) ξ) (Ψ(β (X ξ)) Y β (X ξ))/a(φ). Blanchet, Kang, Olea, Nguyen and Zhang Note that A(X,Y,θ) satisfies the condition of Lemma 2. Therefore sup n EQ1 ... Qd[A(X,Y,θ)((β X) ξ)] : Qj Qj(δj) o = EQ 1 ... Q d [A(X,Y,θ)((β X) ξ)], for any (X, Y, θ), which completes the proof. A.3 Proof of Theorem 2 Proof We write n(Ln(bβ(δn), bφn, δn) L(β , φ )) as the sum of the following three terms n Ln(bβ(δn), bφn, δn) Ln(β , bφn, δn) , (38a) n Ln(β , bφn, δn) Ln(β , bφn, 0) , (38b) n Ln(β , bφn, 0) L(β , φ , 0) . (38c) By Assumption 4, the last term converges in distribution to a normal random variable, so we only need to analyze terms (38a) and (38b). We establish the proof following the four steps outlined in the main body of the paper. Step 1: The same arguments as in Theorem 3.1 in Newey and Mc Fadden (1994) we can show that for any sequence δn = c/ n n(bβ(δn) β ) d Σ(β ) 1Nd( c µ, a(φ )Σ(β )), Σ(β) EP [ Ψ(X β)XX ], ξ A EP [ Ψ((X ξ) β )(X ξ)] (d 1)EP [Y X] + Σ(β )β . The set A above is defined as {ξ {0, 1}d : exactly one entry of ξ is zero}. The argument is essentially the same as in every proof of asymptotic normality for extremum (or M- estimators), with the only difference being that, because of the dropout noise, the score term is asymptotically normal with a nonzero mean. In fact, βLn(β, bφn, δn) βLn(β, bφn, 0)+ 1 i=1 Eδn[(Xi ξ) Ψ(β (Xi ξ))] Xi Ψ(X i β) Dropout Training is Distributionally Robust Optimal βLn(β, bφn, 0) 1 i=1 Xi(Yi Ψ(X i β)). Recognizing the term βLn(β, bφn, 0) as the negative of the score function in the GLM model and doing some algebra, it is possible to show that βLn(β, bφn, δn) is op(1). Therefore (38a) Step 2: For the term in (38b), note first that it is nonnegative. Also: Ln(β , bφn, δn) Ln(β , bφn, 0) equals i=1 Eδn[Ψ((Xi ξ) β )] Ψ(X i β ) The term in parenthesis has a finite mean equal to n EP Eδn[Ψ((X ξ) β )] EP [Ψ(X β )]. (39) It can be shown by verifying the conditions for the Law of Large Numbers for triangular arrays (Theorem 2.2.11 in Durrett 2019) that n(Ln(β , δn) Ln(β , 0) a(bφn) 1 n) p 0. Moreover, Assumptions 1 and 2 imply n n p , where ξ A EP [Ψ((X ξ) β )] d EP [Ψ(X β )] + EP [ Ψ(X β )X β ] Step 3: Since, by Assumption, the term in (34c) is asymptotically normal, then we have shown that n(Ln(bβ(δn), bφn, δn) L(β , φ )) d N( /a(φ ), σ2). Step 4: Finally, we show that n L((bβ(δn), bφn)) L(β , φ ) = OP (1). We have defined Ln(β, φ, δ) 1 i=1 Eδ [ ln f(yi|xi ξi, β, φ)] , L(β, φ) EP [ ln f(Y |X, β, φ)] . Blanchet, Kang, Olea, Nguyen and Zhang Let (β , φ ) be a stationary point of L in the interior of the parameter space, so that β L(β , φ ) = 0, φL(β , φ ) = 0. Further, let bφn denote an arbitrary n-consistent, asymptotically normal estimator of φ . Under our assumptions, L continuously differentiable. Then, by Taylor expansion, L(bβ(δn), bφn) L(β , φ ) = β L( β, φ)(bβ(δn) β ) + φL( β, φ)(bφn φ ), where ( β, φ) lies on the line between (bβ(δn), bφn) and (β , φ ). We have shown that n(bβ(δn) β ) d Σ(β ) 1Nd( c µ, a(φ )Σ(β )). Also we have that β L( β, φ) 0, φL( β, φ) 0 in probability, since ( β, φ) (β , φ ) in probability and (β , φ ) is a stationary point of L. Thus n(L(bβ(δn), bφn) L(β , φ )) 0, in probability by Slutsky s theorem. Moreover, because L is twice continuously differen- tiable, then n(L(bβ(δn), bφ) L(β , φ )) = Op(1). A.4 Proof of Theorem 3 Proof By definition Z(K l ) = K l r(1 r)m l + θl,m0, where K l is a discrete random variable with probability mass function: p(K l ) = r(1 r)K l m0, and supported on the integers larger than m0. Dropout Training is Distributionally Robust Optimal We first show that the estimator Z(K l ) is unbiased (as we average over both K l and ξk i ). Algebra shows that E[Z(K l )] = K=m0 E[Z(K l )|K l = K]p(K) " K l p(K l ) + θl,m0 p(K) + θl,m0 K=m0 E bθ n (2K+1) 1 2(bθO n (2K) + bθE n (2K)) E[bθO n (2m0)] + E[bθE n (2m0)] + E[θl,m0] + lim K E[bθ n (2K+1)]. The expectations in the last line are all finite because Θ is compact. In addition, since the draws are i.i.d. and θl,m0 is the solution to the problem (32) when 2m0 draws are used we E[bθO n (2m 0 )] + E[bθE n (2m 0 )] + E[θl,m0] = 0. Moreover, the sequence of random variables {bθ n (2K+1)} is uniformly integrable, because Θ is a compact subset of a finite-dimensional Euclidean space. Finally, we know that bθ n (2K+1) p θ n as K . The uniform integrability of the sequence of estimators then implies lim K E[bθ n (2K+1)] = E lim K bθ n (2K+1) = θ n , see Theorem 6.2 in Das Gupta (2008). We conclude that E[Z(K l )] = lim K E[bθ n (2K+1)] = θ n . Now we show that the expected computational cost of Z(K l ) is finite. In order to com- pute Z(K) for a given K we need n 2K+1 random draws. Thus, the expected computational Blanchet, Kang, Olea, Nguyen and Zhang cost of Z(K l ) is K=m0 n2K+1r(1 r)K m0 = n (2m0+1) r K=m0 2K m0(1 r)K m0 = n (2m0+1) r K=m0 (2(1 r))K m0. The term above converges to n (2m0+1) r 1 2(1 r) = n (2m0+1) r provided that 2(1 r) < 1, which holds because we have chosen r > 1/2. For the proof of finite variance, we intend to show that E h K K i = O(2 2K) (40) as K . Equation (40) guarantees that every processor generates an estimator Z(K l ) with finite variance. Since K l is a discrete random variable with probability mass function p(K l ) = r(1 r)K m0, E[Z(K l ) Z(K l )] = K=m0 E h Z(K l ) Z(K l )|K l = K i p(K) p(K) + θl,m0 p(K) + θl,m0 K=m0 E K K p(K)2 K=m0 E h θ l,m0θl,m0 i p(K) p(K) + sup θ Θ θ 2 2p(K) 1 22m022(K m0)p(K) + sup θ Θ θ 2 2p(K) 1 r4m0 1 (4(1 r))K m0 Dropout Training is Distributionally Robust Optimal The geometric sum in the last expression is finite because we have assumed that r < 3 To show Equation (40), we do a Taylor expansion of the first-order conditions of the problem (32) around θ n . The Karush Kuhn Tucker optimality condition for the level-2K solution bθ n (2K) of the problem in expression (32) implies k=1 θℓ(xi ξk i , yi, bθ n (2K)) It follows by Taylor expansion and Assumption 4 that k=1 θℓ(xi ξk i , yi, θ n ) k=1 θθℓ(xi ξk i , yi, θ n ) bθ n (2K) θ n k=1 θℓ(xi ξk i , yi, θ n ) i=1 θθEQ h ℓ(X ξ, Y, θ n )|X = xi, Y = yi i ] bθ n (2K) θ n +RK + RK,θ, (41) k=1 θθℓ(xi ξk i , yi, θ n ) θθEQ h ℓ(X ξ, Y, θ n )|X = xi, Y = yi i bθ n (2K) θ n , i=1 sup θ Θ,ξ θθθℓ(xi ξ, yi, θ) 2 bθ n (2K) θ n 2 2 C3 bθ n (2K) θ n 2 by Assumption 4. Thus, by Assumption 3, we have E[R K,θRK,θ] = O(2 2K) Blanchet, Kang, Olea, Nguyen and Zhang as K . Moreover, by the multivariate version of Theorem 2 in Bahr (1965) which follows from the Cram er Wold theorem, we have that k=1 θθℓ(xi ξk i , yi, θ n ) θθEQ h ℓ(X ξ, Y, θ n )|X = xi, Y = yi i is O(2 2K). We can express R KRK as RK 2. The Cauchy Schwarz inequality implies k=1 θθℓ(xi ξk i , yi, θ n ) θθEQ h ℓ(X ξ, Y, θ n )|X = xi, Y = yi i bθ n (2K) θ n 2 By H older s inequality we have k=1 θθℓ(xi ξk i , yi, θ n ) θθEQ h ℓ(X ξ, Y, θ n )|X = xi, Y = yi i E bθ n (2K) θ n 4 Finally, consider the solutions bθ n (2K l +1), bθO n (2K l ), bθE n (2K l ) conditional on K l = K. De- note the remainder terms in Equation (41) corresponding to the level 2K+1 solution bθ n (2K+1) as R K+1, R K+1,θ. Similarly, denote the remainder terms in Equation (41) corresponding to the level-2K solution bθO n (2K) (and, respectively, bθE n (2K)) as RO K, RO K,θ ( RE K, RE K,θ). By the construction of bθO n (2K), bθE n (2K) using odd and even indices, we have, from Equation (41) i=1 θθEQ [ℓ(X ξ, Y, θ n )|X = xi, Y = yi] bθ n (2K+1) 1 2(bθO n (2K) + bθE n (2K)) 2(RO K + RE K) + R K+1,θ 1 2(RO K,θ + RE K,θ). Dropout Training is Distributionally Robust Optimal By Assumption 4, i=1 θθEQ [ℓ(X ξ, Y, θ n )|X = xi, Y = yi] = n θθEQ [ℓ(X ξ, Y, θ n )] is invertible. Thus, we have shown that K bθ n (2K+1) 1 2(bθO n (2K) + bθE n (2K)) = n θθEQ [ℓ(X ξ, Y, θ n )] 1 R K+1 1 2(RO K + RE K) + R K+1,θ 1 2(RO K,θ + RE K,θ) . Since each of the terms on the right-hand side has been shown to be O(2 2K), we conclude that E[ K K] = O(2 2K). A.5 Dropout Training in Linear Regression Here, diag(M) denotes the diagonal matrix formed by the diagonal elements of M. Corollary 2 (Linear regression with φ = 1) For linear regression with ℓ(x, y, β) = (β x y)2, we have min β Rd max Q U(b Pn,δ) EQ h β (X ξ) Y 2i = min β Rd EQ h β (X ξ) Y 2i , where Q = b Pn Q 1 . . . Q d and Q j = (1 δ) 1 Bernoulli(1 δ) for each j = 1, . . . , d. min β Rd EQ h β (X ξ) Y 2i = min β Rd 1 n (Y Xβ) (Y Xβ) + δ 1 δβ Λβ , (42) where Λ = diag(X X) which implies that the dropout training estimator equals bβ(δ) = X X + δ 1 δ diag(X X) 1 X Y. Finally, if EP [XX ] is a diagonal matrix with strictly positive entries then bβ(δ) p (1 δ)EP [XX ] 1EP [XY ]. Proof The first part of the corollary follows directly from (12) and (13) in our main theorem. The second part of the corollary follows from Proposition 1. According to this Blanchet, Kang, Olea, Nguyen and Zhang proposition, the limit of bβ(δ) is β (δ) = EP [XX ] + (δ/1 δ) diag(EP [XX ]) 1 EP [Y X]. (43) Thus, if EP [XX ] is a diagonal matrix, we obtained the desired limit. One interesting implication of Equation (43) is that β (0) which gives the probability limit of the maximum likelihood estimator generally differs from β (δ) when δ = 0, which is the probability limit of the dropout estimator.14 In particular, the estimator in Equation (43) differs from the best linear predictor of y using x as long as EP [Y X] = 0 and δ = 0. This estimator differs from Ridge regression in that diag(EP [XX ]) replaces the identity matrix (and this simple adjustment makes the estimator scale equivariant). A.6 Additional Theoretical Results A.6.1 Convexity of f is necessary for the optimality of dropout noise Theorem 4 Let f : R2 R be a function such that f(y, ) : R R admits bounded and continuous second-order derivatives for any given y R. If, for some δ (0, 1], all d 1, all β Rd and all reference distributions Q0 (where (X, Y ) Q0), it holds that sup Q U(Q0,δ) EQ[f(Y, β X ξ)] = EQ [f(Y, β X ξ)], (44) then f(y, ) is a convex function on the real line for any given y. Proof Since (44) holds for any reference distribution Q0, we consider the case where Q0 is given by the Dirac measure on the point (x, y), where x Rd, y R. Since dropout is the solution to problem (44), we have that the dropout noise on the last coordinate, viz., Q d = (1 δ) 1 Bernoulli(1 δ), solves max Qd Qd(δ) EQd h EQ 1 Q d 1 h f(y, β x ξ) ii . φ(ξd) = EQ 1 Q d 1 j=1 βjxjξj + βdxdξd) 14. Relatedly, Farrell et al. (2021) who study deep neural networks and their use in semiparametric inference report that their numerical exploration of dropout increased bias and interval length compared to nonregularized models. Dropout Training is Distributionally Robust Optimal where φ is implicitly indexed by β, x and y. By duality in semi-infinite linear programming (see Theorem 1 and Equation 4 in Isii 1962), there exists an affine function L( ) such that φ(ξd) L(ξd) ξd [0, 1 1 δ] with equality at ξd = 0 or 1 1 δ. Let ξd = (1 α) 1 1 δ for α [0, 1], we have that φ((1 α) 1 1 δ) L((1 α) 1 1 δ) = L(α 0 + (1 α) 1 1 δ) = αL(0) + (1 α)L( 1 1 δ) = αφ(0) + (1 α)φ( 1 1 δ). Since β, x, y are arbitrary, this is equivalent to saying that, for all (z1, . . . , zd 1), z, y and (i1,...,id 1) {0,1}d 1 p(i1, . . . , id 1)f(y, j=1 zjij + (1 α)z) (i1,...,id 1) {0,1}d 1 p(i1, . . . , id 1)f(y, (i1,...,id 1) {0,1}d 1 p(i1, . . . , id 1)f(y, j=1 zjij + z), where p(i1, . . . , id 1) is the probability mass function for independent Bernoulli trials i1, . . . , id 1. Rearranging the inequality, we have (i1,...,id 1) {0,1}d 1 p(i1, . . . , id 1) j=1 zjij + (1 α)z) f(y, (i1,...,id 1) {0,1}d 1 p(i1, . . . , id 1) j=1 zjij + z) f(y, By second-order Taylor expansion around z = 0, we have j=1 zjij + (1 α)z) f(y, j=1 zjij) = fx(y, j=1 zjij)(1 α)z + O(1)(1 α)2z2, Blanchet, Kang, Olea, Nguyen and Zhang j=1 zjij + z) f(y, j=1 zjij) = fx(y, j=1 zjij)z + 1 j=1 zjij) + o(1) where fx, fxx denote partial derivatives of f with respect to the second argument. Therefore, we have the former inequality becomes (by the cancellation of terms) (i1,...,id 1) {0,1}d 1 p(i1, . . . , id 1)O(1)(1 α) (i1,...,id 1) {0,1}d 1 p(i1, . . . , id 1) j=1 zjij) + o(1) Thus letting α 1, and then letting z 0, we obtain (i1,...,id 1) {0,1}d 1 p(i1, . . . , id 1)fxx(y, j=1 zjij) 0, for all zj R, j. Now we consider z1 = = zd 1 = c R and denote Sd = Pd 1 j=1 ij. Thus Sd follows the binomial distribution B(d 1, p) where p = (1 δ). We rewrite the previous inequality ESd B(d 1,p)[fxx(y, c Sd)] 0. Let us consider c = c (d 1)p, and rewrite c Sd = c (d 1)p(Sd p(d 1)) + c. By the standard Central Limit Theorem, d 1 N(0, p(1 p)), where denotes convergence in law. Therefore, we have that c Sd converges to the Dirac measure at c as d . Since fxx(y, ) is bounded and continuous, we conclude that fxx(y, c) 0 for arbitrary c R. The convexity of f(y, ) then follows. Dropout Training is Distributionally Robust Optimal Corollary 3 Consider the single index models with f(y, β x) = (y g(β x))2. Then sup Q U(Q0,δ) EQ[(Y g(β X ξ))2] = EQ [(Y g(β X ξ))2] for all δ (0, 1], d 1, all β Rd and all reference distribution Q0 (where (X, Y ) Q0) if and only if g is a linear function. Proof The if part is given by Theorem 1 of the paper, as f(y, u) = (y g(u))2 is convex if g is linear. For the only if part, we know that fxx(y, u) = 2yg (u) + 2g(u)g (u) + 2(g (u))2 0 Suppose that g (u0) = 0 for some u0 R, then, by choosing a sufficiently large (in absolute value) y, we have fxx(y, u0) < 0, a contradiction. Thus g (u) = 0 for all u, which implies that g must be linear. A.6.2 Convexity is not sufficient for the optimality of dropout noise Consider a slight generalization of the set U used in Theorem 1 by defining Qj(δ) = {Qj : Qj([0, (1 δ) 1]) = 1, EQj[ξj] = 1, Qj([0, ϵ]) = δ} U(Q0, δ) = {Q0 Q1 . . . Qd : Qj Qj(δ)}. The only difference between U(Q0, δ) and our previous construction is that we force the random variables Qj to have positive mass on the interval [0, ϵ]. Consider the problem sup Q U(Q0,δ) EQ[f Y, β (X ξ) ] where f(y, ) : R R is a convex function for every y. Just as we did in the proof of Theorem 1, we fix ξ2, . . . , ξd, and try to solve sup Q1 Q1(δ) EQ0 Q1 Y, β1X1ξ1 + j=2 βj Xjξj We argue that the optimal solution Q1 in general depends on the values of ξ2, . . . , ξd and, more importantly, Q1 may fail to include zero in its support. This last point implies that Blanchet, Kang, Olea, Nguyen and Zhang Q1 cannot be interpreted as dropout noise (as there is no sense in which the first variable is ever dropped out). We make this point by showing a simple example. Consider Q0 is the Dirac measure on (x, y) with y > 0 and consider f(y, z) = (y z+)2 (i.e., square loss with the so-called Re Lu activation function). Now, f(y, β1x1ξ1 + Pd j=2 βjxjξj) = ((β1x1ξ1 + Pd j=2 βjxjξj)+ y)2 = β2 1x2 1((ξ1 + z)+ y)2 where we assume β1x1 > 0. Thus, the problem is equivalent to sup Q1 Q1(δ) [((ξ1 + z)+ y)2]. If z ( 0.99, ϵ), then by Isii (1962, Theorem 1 ), the problem has the dual inf a,b,c a + b + cδ s.t. ((ξ1 + z)+ y)2 a + bξ1 + c1{ξ1 [0,ϵ]} ξ1 0, 1 1 δ and moreover the optimal solution Q 1 is supported in the set ξ1 0, 1 1 δ : ((ξ1 + z)+ y)2 = a + b ξ1 + c 1{ξ1 [0,ϵ]} where a , b , c constitute the solution to the dual problem. We claim that 0 is not in the support of Q 1. Otherwise, ξ1 = 0 need to satisfy ((ξ1 + z)+ y)2 = a + b ξ1 + c 1{ξ1 [0,ϵ]}. Since the left-hand side is constant for ξ1 in a neighborhood of 0, we have that b 0. Since 0.99 < z < ϵ, we have that, for ξ1 > 0.99, ((ξ1 + z)+ y)2 < y2 while a + b ξ1 + c 1{ξ1 [0,ϵ]} > y2. Hence, the support of Q 1 is a subset of [0, 0.99], contradicting the requirement that Q 1 has mean 1. We conclude that, while convexity is somewhat necessary for dropout to be optimal, it need not be sufficient. A.6.3 Block Dropout Finally, we generalize the notion of dropout so that the noise distributions Q1, . . . , Qd are arbitrarily correlated. Thus, we define U(Q0, δ) = {Q0 Q1:d : Q1:d([0, (1 δ) 1]d) = 1, EQ1:d[ξ] = (1, . . . , 1) } and consider the problem sup Q U(Q0,δ) EQ[f(Y, β X ξ)]. Dropout Training is Distributionally Robust Optimal We give an example where this is a meaningful generalization of the notion of dropout. Consider f(y, z) = (y z)2 (i.e., square loss), also that Q0 is a Dirac measure on (x, y). Then the problem sup Q U(Q0,δ) EQ1:d i=1 xiβiξi) is equivalent to sup Q U(Q0,δ) EQ1:d By Theorem 1 in Isii (1962), the problem has dual inf a0,...,ad i=0 ai s.t. i=1 aiξi ξ [0, (1 δ) 1]d and, moreover, the optimal solution Q 1:d is supported in the set ξ [0, (1 δ) 1]d : where a 0, . . . , a d are the optimal dual variables. Suppose that xiβi = 0 for all i. Then, since Pd i=1 xiβiξi 2 is convex quadratic, any ξ in the support points set (45) must be in the boundary of the cube [0, (1 δ) 1]d. Moreover, since we have the mean constraint EQ 1:d[ξ] = (1, . . . , 1) , there exists ξ in (45) with blocks of components comprising zeros. The above arguments extend to any Q0 and f as long as the function EQ0[f(Y, β X ξ)] is a strictly convex function of ξ. Solving precisely the support of the noise requires solving for the dual variables a i , which is a semi-infinite linear programming (finite-dimensional lin- ear objective with infinitely many constraints). The numerical methods in the literature ap- proximate the semi-infinite problem with a sequence of finite programming problems, which are then solved by applying appropriate linear or nonlinear programming algorithms Het- tich and Kortanek (1993). Since the computation of the noise relies on the value of β, we suspect the overall computational burden will be large (as one first needs to solve for the worst-case block dropout noise, and then solve for the optimal β). A.6.4 Additive Perturbations in the Linear Regression Model Let ξ (ξ1, . . . , ξd) be defined as a d-dimensional vector of random variables that are independent of (X, Y ). We now perturb the distribution Q0 by considering the transforma- Blanchet, Kang, Olea, Nguyen and Zhang (X, Y ) 7 (X1 + ξ1, . . . , Xd + ξd, Y ) . As a result, each covariate Xj is distorted in an additive fashion by ξj. We abbreviate (X1 + ξ1, . . . , Xd + ξd) by X + ξ. We restrict the distribution of ξ in the following way. First, for a parameter λ [0, ), we define Qj(λ) to be the set of distributions for ξj that have mean 0 and variance EQ0[x2 j]. More specifically, Qj(δ) Qj : Qj is a probability distribution on R, EQj[ξj] = 0, EQj[ξ2 j ] λEQ0[x2 j] . Consider now the joint random vector (X, Y, ξ) Rd R Rd. For a constant λ [0, ) consider the joint distributions over (X, Y, ξ) defined by U(Q0, λ) = {Q0 Q1 . . . Qd : Qj Qj(λ) j = 1, . . . , d} , where denotes the product measure (meaning that the joint distribution is the product of the independent marginals Qj, j = 0, . . . , d). For the linear regression model, for any Q U(b Pn, λ) we have EQ h (Y (X + ξ) β)2i = Eb Pn (Y Xβ)2 + j=1 β2 j EQj[ξ2 j ], Eb Pn (Y Xβ)2 + j=1 β2 j λEb Pn[x2 j], h (Y Xβ) (Y Xβ) + λβ Λβ i , where Λ = diag(X X). Therefore, sup Q U(b Pn,λ) EQ h (Y (X + ξ) β)2i = 1 h (Y Xβ) (Y Xβ) + λβ Λβ i . Corollary 2 implies that, for the linear regression model, the distributionally robust es- timator to additive perturbations equals the dropout estimator with dropout probability λ/(1 + λ). This means that the dropout estimator remains distributionally robust over the set of additive and multiplicative perturbations of the empirical distribution: U(b Pn, δ) U(b Pn, δ/(1 δ)). Dropout Training is Distributionally Robust Optimal A.7 Additional Numerical Results Here, we outline an empirical rationale behind our parameter choices. A.7.1 Learning Rate We first fix an all-zeros initialization scheme, and vary the learning rate. We summarize the average parameter divergence and 1-standard deviation error for 20 repetitions of the SGD algorithm in Table 2. We can observe that the learning rate 0.0001 shows a clear advantage. We plot the average parameter divergence and 1-standard deviation error for the SGD trajectory up to the 180 s wall-clock time in Figure 4, which demonstrates that the divergence saturates with the selected learning rate (also see Figure 5 for detail between the 30 s and 180 s wall-clock times). A.7.2 Initialization Next, we fix the learning rate to be 0.0001, and consider different initialization schemes. We note that the mean value (resp., absolute value) of elements in β n is 0.3947 (resp., 0.6977). Table 3 shows the average parameter divergence and the 1-standard deviation from 20 repetitions of the SGD algorithm. We see that the initialization at origin is a fair A.7.3 Naive Monte Carlo with a fixed K We compare to the na ıve Monte Carlo implementation with a fixed K, where we do gra- dient descent on objective (32). The gradient-descent learning rate is searched over the grid {10 i, i = 0, 1, . . .}. We summarize the average parameter divergence and 1-standard deviation error for 20 repetitions of the approach (with the best learning rate) in Table 4. Note that, for small K, the objective (32) has a high bias, while for large K, there are fewer gradient descent steps completed within 60 s due to heavy computational burdens. Learning rate 0.1 0.01 0.001 bβSGD β n 1.1026 0.1705 0.2717 0.0403 0.0827 0.0133 Learning rate 0.0001 0.00001 0.000001 bβSGD β n 0.0301 0.0025 0.6702 0.1082 1.7202 0.0044 Table 2: Comparison for different learning rates, with fixed zero initializations. A.7.4 Wall-Clock Time We document the numerical results for 120 s/180 s wall-clock time; see Figures 6 8 for the case of 120 s, and Figures 9 11 for the case of 180 s. We see that the proposed unbiased Blanchet, Kang, Olea, Nguyen and Zhang Figure 4: SGD trajectory up to 180 s wall-clock time Figure 5: SGD trajectory between 30 s and 180 s wall-clock time approach outperforms the standard SGD when the number of parallel iterations reaches above some threshold. Initializations all zeros all 0.2 s all 1 s bβSGD β n 0.0301 0.0025 0.0317 0.0047 0.0614 0.0196 Initializations i.i.d. N(0, 1) i.i.d. N(0, 10) i.i.d. N(0, 102) bβSGD β n 0.0376 0.0067 0.1006 0.0469 0.3208 0.1432 Table 3: Comparison for different initialization schemes with fixed learning rate 0.0001. Dropout Training is Distributionally Robust Optimal Figure 6: l2 difference for 120 s wall-clock time Figure 7: l difference for 120 s wall-clock time Figure 8: l1 difference for 120 s wall-clock time Blanchet, Kang, Olea, Nguyen and Zhang Figure 9: l2 difference for 180 s wall-clock time Figure 10: l difference for 180 s wall-clock time Figure 11: l1 difference for 180 s wall-clock time Dropout Training is Distributionally Robust Optimal K 25 210 215 bβnaive MC β n 0.7370 0.1181 0.1281 0.0231 0.7417 0.0046 Table 4: Comparison with na ıve Monte Carlo with a fixed K. S Alan, O Attanasio, and M Browning. Estimating Euler equations with noisy data: two exact GMM estimators. Journal of Applied Econometrics, 24(2):309 324, 2009. B Von Bahr. On the convergence of moments in the central limit theorem. Annals of Mathematical Statistics, 36(3):808 818, 06 1965. C M Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Compu- tation, 7(1):108 116, 1995. J Blanchet, P Glynn, and Y Pei. Unbiased multilevel Monte Carlo: Stochastic opti- mization, steady-state simulation, quantiles, and other applications. ar Xiv preprint ar Xiv:1904.09929, 2019a. J Blanchet, Y Kang, and K Murthy. Robust Wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56:830 857, 2019b. S Boyd and L Vandenberghe. Convex Optimization. Cambridge University Press, 2004. T Christensen and B Connault. Counterfactual sensitivity and robustness. ar Xiv preprint ar Xiv:1904.00989, 2019. A Das Gupta. Asymptotic Theory of Statistics and Probability. Springer Verlag, 2008. E Delage and Y Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595 612, 2010. T Devries and G W Taylor. Improved regularization of convolutional neural networks with cutout. Co RR, abs/1708.04552, 2017. D Draper. Assessment and propagation of model uncertainty. Journal of the Royal Statistical Society, Series B, 56, 1994. J Duchi and H Namkoong. Variance-based regularization with convex objectives. Journal of Machine Learning Research, 20(1):2450 2504, 2019. R Durrett. Probability: Theory and Examples. Cambridge University Press, 2019. Blanchet, Kang, Olea, Nguyen and Zhang L Fahrmeir and H Kaufmann. Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models. The Annals of Statistics, pages 342 M H Farrell, T Liang, and S Misra. Deep neural networks for estimation and inference. Econometrica, 89(1):181 213, 2021. T Ferguson. Mathematical Statistics: A Decision Theoretic Approach, volume 7. Academic Press New York, 1967. M B Giles. Multilevel Monte Carlo path simulation. Operations Research, 56(3):607 617, M B Giles. Multilevel Monte Carlo methods. Acta Numerica, 24:259, 2015. I Goodfellow, Y Bengio, and A Courville. Deep Learning. MIT Press, 2016. L P Hansen and T J Sargent. Robustness. Princeton University Press, 2008. D P Helmbold and P M Long. On the inductive bias of dropout. The Journal of Machine Learning Research, 16(1):3403 3454, 2015. R Hettich and K Kortanek. Semi-infinite programming: Theory, methods, and applications. SIAM Review, 35(3):380 429, 1993. G E Hinton, N Srivastava, A Krizhevsky, I Sutskever, and R R Salakhutdinov. Improv- ing neural networks by preventing co-adaptation of feature detectors. ar Xiv preprint ar Xiv:1207.0580, 2012. K Hornik, M Stinchcombe, and H White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359 366, 1989. J T Hwang. Multiplicative errors-in-variables models with applications to recent data re- leased by the US Department of Energy. Journal of the American Statistical Association, 81(395):680 688, 1986. K Isii. On sharpness of Tchebycheff-type inequalities. Annals of the Institute of Statistical Mathematics, 14(1):185 197, 1962. J Kim and W Winkler. Multiplicative noise for masking continuous data. Statistics, 1:9, R H Lyles and L L Kupper. A detailed evaluation of adjustment methods for multiplicative measurement error in linear regression with applications in occupational epidemiology. Biometrics, pages 1008 1025, 1997. Dropout Training is Distributionally Robust Optimal L Maaten, M Chen, S Tyree, and K Weinberger. Learning with marginalized corrupted features. In International Conference on Machine Learning, pages 410 418, 2013. P Mc Cullagh and J A Nelder. Generalized Linear Models. Chapman & Hall, 1989. T K Nayak, B Sinha, and L Zayatz. Statistical properties of multiplicative noise masking for confidentiality protection. Journal of Official Statistics, 27(3):527, 2011. A Nemirovski, A Juditsky, G Lan, and A Shapiro. Robust stochastic approximation ap- proach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, W K Newey and D Mc Fadden. Large sample estimation and hypothesis testing. Handbook of Econometrics, 4:2111 2245, 1994. V A Nguyen, X Zhang, J Blanchet, and A Georghiou. Distributionally robust parametric maximum likelihood estimation. In Advances in Neural Information Processing Systems D S Park, W Chan, Y Zhang, C-C Chiu, B Zoph, E D Cubuk, and Q V Le. Spec Augment: A simple data augmentation method for automatic speech recognition. In Proc. Interspeech 2019, pages 2613 2617, 2019. D A Pierce, D O Stram, M Vaeth, and D W Schafer. The errors-in-variables problem: considerations provided by radiation dose-response analyses of the a-bomb survivor data. Journal of the American Statistical Association, 87(418):351 359, 1992. A E Raftery, D Madigan, and J A Hoeting. Bayesian model averaging for linear regression models. Journal of the American Statistical Association, 92(437):179 191, 1997. H Rahimian and S Mehrotra. Distributionally robust optimization: A review. ar Xiv preprint ar Xiv:1908.05659, 2019. H Robbins and S Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400 407, 1951. H Scarf. A min max solution of an inventory problem. Studies in the Mathematical Theory of Inventory and Production, 10:201 209, 1958. J Schmidt-Hieber. Nonparametric regression using deep neural networks with Re LU acti- vation function. The Annals of Statistics, 48(4):1875 1897, 2020. A Shapiro. Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4):2258 2275, 2017. Blanchet, Kang, Olea, Nguyen and Zhang A Shapiro, D Dentcheva, and A Ruszczy nski. Lectures on Stochastic Programming: Model- ing and Theory. SIAM, 2014. N Srivastava, G Hinton, A Krizhevsky, I Sutskever, and R Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929 1958, 2014. S Wager, S Wang, and P S Liang. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems 26, pages 351 359. 2013. M J Wainwright and M I Jordan. Graphical Models, Exponential Families, and Variational Inference. Now Publishers Inc, 2008. S Wang and C Manning. Fast dropout training. In International Conference on Machine Learning, pages 118 126, 2013. C Wei, S Kakade, and T Ma. The implicit and explicit regularization effects of dropout. In International Conference of Machine Learning, 2020. W Wiesemann, D Kuhn, and M Sim. Distributionally robust convex optimization. Opera- tions Research, 62(6):1358 1376, 2014. M Zinkevich, M Weimer, L Li, and A J Smola. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 2595 2603, 2010.