# dropout_training_is_distributionally_robust_optimal__6d657887.pdf

Journal of Machine Learning Research 24 (2023) 1-60 Submitted 4/21; Revised 4/23; Published 6/23

Dropout Training is Distributionally Robust Optimal

Jos e Blanchet jose.blanchet@stanford.edu Department of Management Science and Engineering Stanford University Stanford, CA 94305, USA

Yang Kang yangkang@stat.columbia.edu Department of Statistics Columbia University New York, NY 10027, USA

Jos e Luis Montiel Olea jlo67@cornell.edu Department of Economics Cornell University Ithaca, NY 14850, USA

Viet Anh Nguyen nguyen@se.cuhk.edu.hk Department of Systems Engineering and Engineering Management Chinese University of Hong Kong Hong Kong SAR

Xuhui Zhang xuhui.zhang@stanford.edu Department of Management Science and Engineering Stanford University Stanford, CA 94305, USA

Editor: Samory Kpotufe

This paper shows that dropout training in generalized linear models is the minimax solution of a two-player, zero-sum game where an adversarial nature corrupts a statistician s covariates using a multiplicative nonparametric errors-in-variables model. In this game, nature s least favorable distribution is dropout noise, where nature independently deletes entries of the covariate vector with some ﬁxed probability δ. This result implies that dropout training indeed provides out-of-sample expected loss guarantees for distributions that arise from multiplicative perturbations of in-sample data. The paper makes a concrete recommendation on how to select the tuning parameter δ. The paper also provides a novel, parallelizable, unbiased multi-level Monte Carlo algorithm to speed-up the implementation of dropout training. Our algorithm has a much smaller computational cost compared to the naive implementation of dropout, provided the number of data points is much smaller than the dimension of the covariate vector.

Keywords: generalized linear models, generalization error, distributionally robust optimization, machine learning, multi-level Monte Carlo

c 2023 Jos e Blanchet, Yang Kang, Jos e Luis Montiel Olea, Viet Anh Nguyen, Xuhui Zhang.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-0377.html.

Blanchet, Kang, Olea, Nguyen and Zhang

1. Introduction

Dropout training is an increasingly popular estimation method in machine learning.1 The

general idea consists in ignoring some dimensions of the covariate vector at random while

estimating the parameters of a statistical model. A common motivation for dropout train-

ing is that the random feature selection implicitly performs model averaging, potentially

improving prediction error.2

This paper contributes to the growing literature explaining how dropout training can

improve a predictor s generalization error, e.g., Wager et al. (2013), Helmbold and Long

(2015), Wei et al. (2020). Broadly speaking, a predictor s generalization error refers to

its ability to perform well on new, previously unseen inputs not just those on which our

model was trained. (Goodfellow et al., 2016, Section 5.2). As we explain below, our main

result shows that dropout training improves generalization error over distributions that

arise from multiplicative perturbations of in-sample covariates.

To make this point, this paper studies dropout training in the context of generalized

linear models. A generalized linear model for the scalar outcome variable Y Y R, given

a d-dimensional vector of covariates X = (X1, . . . , Xd) Rd, is deﬁned by the conditional

f(Y |X, θ) h(Y, φ) exp Y β X Ψ(β X) /a(φ) , (1)

where the model s parameters are θ (β , φ) ; see Mc Cullagh and Nelder (1989, Equa-

tion 2.4). In our notation, h( , φ) is a real-valued function parameterized by φ and deﬁned

on the domain Y, a( ) is a positive function of φ, and Ψ( ) is the log-partition function

which we assume to be deﬁned on all the real line. Normal, logistic, and Poisson regression

all have conditional densities of the form in Equation (1).

Prediction in generalized linear models is typically based on the conditional expectation

of Y , given X, implied by the likelihood in (1), evaluated at an estimated parameter vector bθ (for example, one common choice for bθ is the vector containing the maximum likelihood

estimators for β and φ based on the available data).3 We deﬁne the generalization error of

a predictor for Y as:

EQ h ln f(Y |X, bθ ) i , where (X, Y ) Q. (2)

1. Section 7.12 of Goodfellow et al. (2016) provides a textbook treatment on dropout training. Bishop (1995) and Srivastava et al. (2014) are seminal references on this topic. 2. See Hinton et al. (2012) for a discussion about this point in the context of neural networks. See also Draper (1994) and Raftery et al. (1997) for classical results on the optimality of model averaging for prediction purposes. 3. Throughout the paper, we will typically use bθ, bβ, bφ to denote the estimators of parameters θ, β, φ, without necessarily making explicit reference to the data or the sample size.

Dropout Training is Distributionally Robust Optimal

The above expectation is computed by ﬁxing the estimated parameter vector, bθ, and then

drawing new covariates and outcomes according to the joint distribution Q. When Q dif-

fers from the data s empirical distribution, we interpret (2) as a measure of out-of-sample

performance (as the draws from Q can be thought of as new, and previously unseen, inputs).

Our main result (Theorem 1) shows that, when Q corresponds to multiplicative perturba-

tion of the covariates empirical distribution (in a sense we make precise), the generalization

error of dropout training is no larger than the in-sample loss of dropout training averaged

over dropout noise (Corollary 1). Therefore, our main result implies that dropout training

generalizes well to previously unseen data distributions arising from in-sample covariates

multiplicative perturbations.

We establish our main result by showing that dropping out input features when training

generalized linear models can be viewed as a Distributionally Robust Optimization (DRO)

problem (Shapiro et al., 2014). A DRO problem is a two-player, zero-sum game between

a decision maker (a statistician) and an adversary (nature). The statistician wishes to

choose an action to minimize a given expected loss (e.g., squared loss in a typical linear

regression setting or, more generally, the negative of the log-likelihood function). At the

same time, nature intends this loss to be maximal. We consider a framework in which

nature can harm the statistician by corrupting the available covariates using a multiplicative

nonparametric errors-in-variables model, as in the classic work of Hwang (1986).4 Under

mild assumptions, nature s least favorable distribution in this game turns out to be dropout

noise, where nature independently deletes entries of the covariate vector with some ﬁxed

probability δ. Because dropout noise is least favorable, we can use the in-sample loss of

dropout training averaged over the dropout noise to upper bound the generalization

error of predictors computed via dropout training, under any probability distribution that

arises via multiplicative perturbation of the covariates in the sample.

One might argue that the signiﬁcance and usefulness of Theorem 1 is limited, as the

distributions over which dropout training generalizes well only cover perturbations around

the data s empirical distribution. Indeed, in the quintessential deﬁnition of generalization

error, the new examples are typically drawn from the true data-generating process, which we

denote as P . Do our results have anything to say about the case in which the generalization

error in (2) is evaluated at P ? We answer this question in the positive, provided that

the dropout probability, δ, is chosen to be c/ n where n denotes the number of training

examples. In particular, we show that the generalization error of dropout training based

on examples drawn from the true data-generating process will be bounded above by the

loss of dropout in the training sample (averaged over the dropout noise) with a probability

4. As far as we know, this simple action space for nature is a novel model for conceptually and quantitatively understanding dropout training.

Blanchet, Kang, Olea, Nguyen and Zhang

that depends on c (Theorem 2). Consequently, by choosing a target probability, say 95%,

it is possible to provide a concrete recommendation for selecting c and, therefore, for δ. As

we will explain later, our concrete recommendation for choosing the dropout probability

in generalized linear models stands in contrast to ad hoc suggestions available for neural

networks, in which an input unit is usually included with probability 0.8, and a hidden unit

is included with probability 0.5 (Goodfellow et al., 2016, Chapter 7, p. 257).

Finally, and to analyze the implications of our results numerically, we make an additional

contribution by suggesting a new stochastic optimization implementation of dropout train-

ing. The generalization error of the standard stochastic gradient descent implementation

of dropout training here proﬁts from the implicit regularization imposed by the stochastic

gradient descent routine (Wei et al., 2020). Since our theoretical analysis makes no use of

this implicit regularization, it is important to have an algorithm that does not introduce

any further bias to the solution of dropout training. We borrow ideas from the multi-level

Monte Carlo literature in particular from the work of Blanchet et al. (2019a) to suggest

an unbiased (in a sense we will make precise) dropout training routine. Our algorithm is

easily parallelizable and has a much smaller computational cost than naive dropout training

methods when the number of features is large (Theorem 3). Our algorithm thus comple-

ments the recent literature suggesting approaches to speed-up dropout training by either

using a parallelized implementation of stochastic gradient descent (Zinkevich et al., 2010)

or a fast dropout training based on Gaussian approximations (Wang and Manning, 2013).

The rest of the paper is organized as follows. Section 2 explains dropout training in the

context of generalized linear models. Section 3 presents a general description of the DRO

framework used in this paper. Section 4 zeros in on the DRO problem by using the negative

log-likelihood of generalized linear models to deﬁne a loss function for the statistician, and

by allowing nature to harm the statistician via a multiplicative errors-in-variables model

for the covariates. This section also presents our main theorem. Section 5 presents our

approach to select the dropout probability, δ. Section 6 discusses diﬀerent computational

methods available for implementing dropout training (full integration, stochastic gradient

descent, na ıve Monte Carlo integration) and suggests our unbiased multi-level Monte Carlo

algorithm. Section 7 presents some simulations comparing our recommended selection of

δ to cross-validation and our preferred implementation of dropout training over stochastic

gradient descent. Finally, Section 8 discusses extensions of our results to a particular class

of feed-forward neural networks with a single hidden layer. We show that dropout training

of the hidden units in the hidden layer is distributionally robust and optimal. All of the

proofs are collected in the Appendix.

Dropout Training is Distributionally Robust Optimal

2. Dropout Training in Generalized Linear Models

This section describes dropout training in the context of generalized linear models. As

with some other recent papers in the literature, we view generalized linear models as a

convenient, transparent, and relevant framework to better understand the theoretical and

algorithmic properties of dropout training.

For a given covariate vector xi, and a user-selected constant, δ [0, 1), deﬁne the

d-dimensional random vector

ξi = (ξi,1, . . . , ξi,d) {0, 1/(1 δ)}d,

where each of the d entries of ξi is an independent draw from a scaled Bernoulli distribution

with parameter 1 δ. This is, for j = 1, . . . , d:

0 with probability δ,

(1 δ) 1 with probability (1 δ). (3)

The distribution of ξi,j collapses to ξi,j = 1 with probability 1 when δ = 0.

Let denote the binary operator deﬁning element-wise multiplication between two

vectors of the same dimension. Consider the covariate vector

xi ξi (xi,1ξi,1, . . . , xi,dξi,d) . (4)

Some entries of the new covariate vector are 0 (those for which ξi,j = 0), and the rest are

equal to xi,j/(1 δ).

In a slight abuse of notation, let Eδ denote the distribution of the random vector ξi,

whose distribution is parameterized by δ. The estimators of (β, φ) obtained by dropout

training correspond to any parameters (bβ(δ), bφ(δ)) that solve the problem

inf β,φ 1 n

i=1 Eδ [ ln f(yi|xi ξi, β, φ)] . (5)

Note that the maximum likelihood estimator of (β, φ), denoted as (bβML, bφML), equals

(bβ(0), bφ(0)).

There is some empirical evidence that using intentionally corrupted features for training

has the potential to improve the performance of machine learning algorithms (Maaten et al.,

2013). Even if one is willing to accept that feature corruption is desirable for estimation,

the choice of dropout noise in (3) seems arbitrary.

Blanchet, Kang, Olea, Nguyen and Zhang

To explain the empirical success of the use of dropout noise. Wager et al. (2013) and

Helmbold and Long (2015) theoretically analyzed the dropout training criterion (5) as the

composition of the original loss (e.g., Equation 5 with δ = 0) and a penalty term that

they view as a regularizer. They then analyzed this dropout regularizer and compared its

regularization eﬀect with other regularizers (such as L1 and L2 penalties). Speciﬁcally, Wa-

ger et al. (2013) proposed a convex quadratic approximation for the regularization penalty

using ﬁrst-order Taylor approximation in generalized linear models. While Helmbold and

Long (2015) conducted a more explicit analysis in the context of logistic regression and

suggested several non-standard properties of the dropout regularizer.

Relative to these papers, our work focuses on providing a novel decision-theoretic inter-

pretation of dropout training (in the population and the sample). We will argue there is

a natural two-player, zero-sum game between a decision maker (statistician) and an adver-

sary (nature) in which dropout training emerges naturally as a minimax solution. In this

game, dropout noise is nature s least favorable distribution, and dropout training becomes

the statistician s optimal action. The framework we use, from the stochastic optimization

literature, is distributionally robust optimization.

3. Distributionally Robust Optimization

Consider a general prediction problem where there is a multivariate predictor X Rd and

a scalar outcome variable Y R. A distributionally robust optimization (DRO) problem

is a simultaneous two-player zero-sum game between a decision maker (in our context, the

statistician) and an adversary (nature).5 In this section, we describe the action space for

each player, their strategies, and the payoﬀfunction.

3.1 Actions and Payoﬀ

The statistician s action space consists of vectors θ Θ. The ranking of the statistician s

actions is contingent on the realization of (X, Y ), and this is captured by a real-valued loss

function ℓ(X, Y, θ). If the statistician knew the distribution of (X, Y ) which we denote by

Q the statistician s preferred choice of θ would be the solution to

inf θ Θ EQ [ℓ(X, Y, θ)] . (6)

5. A seminal reference is the robust inventory control problem of Scarf (1958). Recent references describing the use of distributionally robust stochastic programs (like those considered in this paper) include Delage and Ye (2010) and Shapiro (2017). Christensen and Connault (2019) used distributionally robust optimization to characterize the sensitivity of counterfactual analysis with respect to distributional assumptions in a class of structural econometric models.

Dropout Training is Distributionally Robust Optimal

Instead of assuming that the distribution Q is determined exogenously, we think of the

distribution Q as being chosen by nature. Thus, nature s action space consists of a set of

probability distributions denoted as U. We refer to this set as the distributional uncertainty

set. If nature knew the action selected by the statistician, nature s preferred action would

sup Q U EQ[ℓ(X, Y, θ)]. (7)

3.2 Strategies and Solution

The choices of θ and Q are assumed to happen simultaneously. A statistician s strategy for

this game consists of a choice of θ; nature s strategy for this game consists of a choice of Q.

A Nash equilibrium for this game is a pair (θ , Q ) such that: a) given Q , the pa-

rameter θ solves (6) and b) given θ , the distribution Q solves (7).

The minimax solution for this game is a pair (θ, Q) that solves

sup Q U EQ[ℓ(X, Y, θ)], (8a)

while the maximin solution is based on the mathematical program

sup Q U inf θ Θ

EQ[ℓ(X, Y, θ)]. (8b)

If Q solves (8b), we say that Q is nature s least favorable distribution. The mathematical

program in (8a) is typically referred to as a DRO problem.

4. Dropout Training is Distributionally Robust Optimal

We now specialize the general DRO framework of Section 3 by imposing two restrictions.

First, we use the negative log-likelihood of generalized linear models (Mc Cullagh and Nelder,

1989) as a loss function for the statistician. Second, we deﬁne nature s uncertainty set (i.e.,

the possible data distributions that nature can take) using the multiplicative errors-in-

variables model of Hwang (1986).

4.1 Statistician s Payoﬀ

We deﬁne the loss function for the statistician to be the negative of the logarithm of the

likelihood in (1), that is,

ℓ(X, Y, θ) = ln h(Y, φ) + (Ψ(β X) Y (β X))/a(φ), (9)

Blanchet, Kang, Olea, Nguyen and Zhang

where θ (β , φ) Θ Rd R+. Equation (9) deﬁnes the statistician s objective and

its set of actions.

4.2 Nature s Distributional Uncertainty Set

We now deﬁne the possible distributions that nature can choose. We start out by letting

Q0 denote some benchmark or reference distribution over (X, Y ). This distribution will

typically be the empirical distribution of the data, but we present our main result allowing

for a general Q0. This distribution need not correspond to that induced by a generalized

linear model. In other words, our framework allows for the statistician s model to be mis-

Next, we deﬁne nature s action space by considering perturbations of Q0. Although

there are diﬀerent ways of doing this for example, by using either f-divergences (such

as the Kullback Leibler divergence, as used by Nguyen et al. 2020 and the χ2 divergence,

as done by Duchi and Namkoong 2019) or the optimal transport distance (such as the

Wasserstein distance, as in Blanchet et al. 2019b) to deﬁne a neighborhood we herein use

a nonparametric multiplicative errors-in-variables model as in Hwang (1986).6

The idea is to allow nature to independently introduce measurement error to the covari-

ates using multiplicative noise. Let ξ (ξ1, . . . , ξd) be deﬁned as a d-dimensional vector

of random variables that are independent of (X, Y ). We perturb the distribution Q0 by

considering the transformation

(X, Y ) 7 (X1ξ1, . . . , Xdξd, Y ) .

As a result, each covariate Xj is distorted, in a multiplicative fashion, by ξj. We often

abbreviate (X1ξ1, . . . , Xdξd) by X ξ, where signiﬁes element-wise multiplication.

We restrict the distribution of ξ in the following way: First, for a parameter δ [0, 1),

we deﬁne Qj(δ) to be the set of distributions for ξj that are supported on the interval

[0, 1/(1 δ)] and that have mean equal to 1. More speciﬁcally,

Qj(δ) Qj : Qj is a probability distribution on R, Qj([0, (1 δ) 1]) = 1, EQj[ξj] = 1 .

This set of distributions, prescribed using support and ﬁrst-order moment information, is

popular in the DRO literature thanks to its simplicity and tractability (Wiesemann et al.,

2014). From the perspective of an errors-in-variables model, these distributions are also

6. The diﬀerent choices of action space in the DRO literature have been shown to enjoy regularization beneﬁts. For example, Duchi and Namkoong (2019) showed that a DRO formulation with χ2 divergence leads to convex variance regularization. Blanchet et al. (2019b) showed that DRO with Wasserstein distance amounts to square-root LASSO in linear regressions. In contrast, the action space we introduce induces dropout regularization.

Dropout Training is Distributionally Robust Optimal

attractive because they preserve the expected value of the covariates, assuming that Xj and ξj are drawn independently.

Consider now the joint random vector (X, Y, ξ) Rd R Rd. For a constant δ [0, 1)

consider the joint distributions over (X, Y, ξ) deﬁned by

U(Q0, δ) = {Q0 Q1 . . . Qd : Qj Qj(δ) j = 1, . . . , d} , (11)

where denotes the product measure (meaning that the joint distribution is the product

of the independent marginals Qj, j = 0, . . . , d). Thus, in the game, we consider U(Q0, δ) as

nature s action space or nature s distributionally uncertainty set.

We will make only one assumption about the reference distribution: Q0 :

Assumption 1 The distribution Q0 satisﬁes EQ[ℓ(X ξ, Y, θ)] < for any Q U(Q0, δ),

any θ Θ, and any scalar δ [0, 1).

This assumption implies a minimal regularity condition to guarantee that the expected

loss is well deﬁned for both the statistician and nature. Assumption 1 holds trivially when

Q0 is the empirical distribution of the data, which is one of the main cases of interest in

4.3 Dropout Training is Distributionally Robust Optimal

Theorem 1 Consider the two-player zero-sum game where the statistician has the loss

function in (9) and nature has the action space in (11) for some reference distribution Q0 and a scalar δ [0, 1). If Assumption 1 holds, then for any θ Θ,

sup Q U(Q0,δ) EQ [ℓ(X ξ, Y, θ)] (12)

is equivalent to

EQ [ℓ(X ξ, Y, θ)], (13)

where Q = Q0 Q 1 . . . Q d , and Q j = (1 δ) 1 Bernoulli(1 δ) is a scaled Bernoulli

distribution for any j = 1, . . . , d, i.e., under Q j

0 with probability δ,

(1 δ) 1 with probability (1 δ). (14)

In addition, let θ Θ be a minimizer of (13). Then (θ , Q ) constitutes a Nash equi-

librium of the two-player zero-sum game deﬁned by (9) and (11), and Q is nature s least

favorable distribution.

Blanchet, Kang, Olea, Nguyen and Zhang

Proof See Appendix A.2.

The ﬁrst part of the theorem characterizes nature s worst-case perturbation of Q0. From

the statistician s perspective, nature s worst-case perturbation of Q0 is given by the dropout

noise Q in (14). Under this worst-case distribution, nature independently corrupts each

entry of X = (X1, . . . , Xd) , either by dropping the j-th component (if ξj = 0), or by re-

placing it by Xj/(1 δ). Dropout training which here refers to estimating the parameter

θ after adding dropout noise to X thus becomes the statistician s preferred way of esti-

mating the parameter θ when facing an adversarial nature. This gives a decision-theoretic

foundation for the use of dropout training.

Note that, in order to recover the objective function introduced in (5) (the sample

average of the contaminated log-likelihood), it suﬃces to set the reference measure Q0

as the empirical distribution b Pn of {(xi, yi)}n i=1, which satisﬁes Assumption 1.

We provide now some intuition about how dropout noise becomes nature s worst-case

distribution. Algebra shows that, in light of Assumption 1, the expected loss under an

arbitrary distribution Q is ﬁnite and can be written as

EQ[ℓ(X ξ, Y, θ)] = EQ0 [ln h(Y, φ)]

+ EQ0 h EQ1 ... Qd[(Ψ((β X) ξ) Y ((β X) ξ))/a(φ)] i ,

where the ﬁrst expectation is taken with respect to the reference distribution, and the

second with respect to ξ. For ﬁxed values of (X, Y, θ), we can deﬁne the function

A(X,Y,θ)((β X) ξ) (Ψ((β X) ξ) Y ((β X) ξ))/a(φ).

The key step to establishing Theorem 1 is to show that

sup n EQ1 ... Qd[A(X,Y,θ)((β X) ξ)] : Qj Qj(δ) o = EQ 1 ... Q d [A(X,Y,θ)((β X) ξ)]

for any θ. The proof of this equality crucially exploits the convexity of the function A(X,Y,θ) :

R R, which we show to be a consequence of the convexity of the log-partition function

Ψ( ).7 We make four important remarks about the equality in (15), and about the role that

convexity plays in the optimality of dropout training.

Remark 1. First, a natural question to ask is whether the convexity of the log-partition

function Ψ imposes a signiﬁcant restriction on the class of generalized linear models that

7. To derive our result, we ﬁrst characterize the worst-case distribution for the expectation of a real-valued convex function (Lemma 1) and then we generalize this result to functions that depend on ξ only through linear combinations, as A(X,Y,θ)( ) (Lemma 2).

Dropout Training is Distributionally Robust Optimal

we are considering. It is known that the log-partition function of any generalized linear

model with an open parameter space is convex; see Proposition 3.1 in Wainwright and

Jordan (2008). This immediately implies that the convexity of Ψ should not be viewed as

a signiﬁcant restriction.

Remark 2. Second, it is reasonable to inquire whether our results concerning the optimality

of dropout training could be extended beyond generalized linear models. For instance, one

could be interested in considering a model in which linear predictors are obtained using an

objective function of the form

EQ0[f(Y, β X)],

where f(y, ) : R R, and (X, Y ) Q0. This formulation includes certain types of

extremum estimators as well as single-index models estimated by nonlinear least squares.

Theorem 4 in Appendix A.6.1 shows that, if dropout noise is nature s worst-case distribution

among multiplicative perturbations of covariates; that is, if

sup Q U(Q0,δ) EQ[f(Y, β X ξ)] = EQ [f(Y, β X ξ)] (16)

for some δ (0, 1], all d 1, all β Rd and all reference distributions Q0, then f(y, )

must be a convex function on the real line for any given y. This means that convexity

indeed plays a crucial role in the optimality of dropout training and, while it is possible

to consider extensions outside the class of generalized linear models, such extensions will

still have to impose some form of convexity (which may turn out to be very strong).8 In

Appendix A.6.2, we also argue that, beyond generalized linear models, the convexity of the

objective function and the deﬁnition of U are, in general, not suﬃcient to make dropout

training minimax-optimal.

Remark 3. Third, to further understand the extent to which convexity-like restrictions

imply the optimality of dropout noise, we consider a generalization of the multiplicative

errors-in-variables model that allows for correlated noise. We show that, in this case, there

is a simple generalization of dropout (that we term block dropout ) that drops out blocks

of variables at the same time. See Appendix A.6.3.

Remark 4. Fourth, Theorem 1 does not imply that U(Q0, δ) is the largest class of distri-

butions for which dropout training is distributionally robust optimal (or for which dropout

noise is the worst-case distribution). In appendix A.6.4 we show that in the context of

the linear regression model the dropout estimator can also be obtained by solving a dis-

tributionally robust optimization problem over additive contamination models (provided

8. For instance, in Appendix A.6.1, we show that the only single-index model with an objective function of the form f(y, β x) = (y g(β x))2 for which dropout training is optimal is the linear regression model; that is, g(β x) = β x.

Blanchet, Kang, Olea, Nguyen and Zhang

the additive perturbations are independent of the data, have a mean of zero, and satisfy a

certain bound on their second moments). Thus, the union of additive and multiplicative

perturbations still gives dropout noise as the worst-case distribution, and makes the dropout

training estimator distributionally robust optimal.

How about the Nash Equilibrium of the two-player zero-sum game deﬁned by Equa-

tions (9) and (11)? The equality in Equation (15) clearly shows that Q is nature s best

response for any θ Θ. If there is a vector θ that solves the dropout training problem in

Equation (13), then this vector is the statistician best s response to nature s choice of Q .

Consequently, (θ , Q ) is a Nash equilibrium.

Finally, we discuss the extent to which Q can be referred to as nature s least favorable

distribution, which has been deﬁned as nature s solution to the maximin problem. It is well

known that the maximin value of a game is always smaller than its minimax value:9

sup Q U(Q0,δ) inf θ Θ

EQ[ℓ(X ξ, Y, θ)] inf

sup Q U(Q0,δ) EQ[ℓ(X ξ, Y, θ)].

We have shown that the right-hand side of the display above equals the inﬁmum of (13).

Therefore, if there is a θ Θ that solves such program, then Q achieves the upper bound

to the maximin value of the game. To see this, note that by deﬁnition of the left-hand side

of the display above

inf θ Θ EQ [ℓ(X ξ, Y, θ)] sup Q U(Q0,δ) inf θ Θ

EQ[ℓ(X ξ, Y, θ)],

while we also have

inf θ Θ EQ [ℓ(X ξ, Y, θ)] = inf

sup Q U(Q0,δ) EQ[ℓ(X ξ, Y, θ)],

by Equation (13). This makes dropout noise nature s least favorable distribution.

Now that we have established that dropout training gives the minimax solution of the

DRO game, we discuss the implications of this result regarding the out-of-sample perfor-

mance of dropout training. Suppose Q0 is the empirical measure b Pn supported on n training

samples {(xi, yi)}n i=1. Let bθ(δ) denote the estimators of (β, φ) based on dropout training

with dropout probability δ. The in-sample loss of dropout training, averaged over dropout

9. This follows from the fact that for any Q U(Q0, δ) :

inf θ Θ EQ[ℓ(X ξ, Y, θ)] EQ[ℓ(X ξ, Y, θ)] sup Q U(Q0,δ) EQ[ℓ(X ξ, Y, θ)].

See also the discussion of the minimax theorem in Ferguson 1967, p. 81.

Dropout Training is Distributionally Robust Optimal

noise, is 1 n

i=1 EQ [ℓ(X ξ, Y, bθ(δ))|X = xi, Y = yi], (17)

which is equivalent to the objective function of dropout noise that we presented in equation

(5), evaluated at the parameters estimated using dropout training; that is

i=1 Eδ h ln f yi|xi ξi, bβ(δ), bφ(δ) i .

A typical concern about estimation procedures is whether their performance in a speciﬁc

sample translates to good performance out of sample. In our context, the out-of-sample

performance of dropout training can be thought of as the expected loss that would arise

for some other data distribution Q over (X, Y ) at the parameter estimated via dropout

E Q[ℓ(X, Y, bθ(δ))].

The following is a direct corollary of Theorem 1:

Corollary 1 Let bθ(δ) = (bβ(δ) , bφ(δ)) denote the dropout estimators of β and φ given

dropout noise δ. Consider any distribution Q over (X, Y ) that can be obtained from b Pn by

perturbing covariates with mean-one, independent multiplicative error ξj [0, (1 δ) 1].

That is, if Q corresponds to the distribution of a vector of the form

( X1ξ1, . . . , Xdξd, Y ) ,

where ( X1, . . . , Xd, Y ) b Pn and (ξ1, . . . ξd) is a vector of i.i.d. random variables (indepen-

dent of the empirical distribution of the data), supported on [0, (1 δ) 1]. Then,

E Q[ℓ(X, Y, bθ(δ))] 1

i=1 Eδ h ln f yi|xi ξi, bβ(δ), bφ(δ) i . (18)

This means that the objective function used for dropout training will provide an upper

bound for the out-of-sample loss associated with multiplicative perturbations of the training

Finally, we note that Theorem 1 was stated for a scalar δ that is homogeneous across

the multiplicative noise ξj. To model non-identical dropout noise, we can substitute the sets

in Equations (10) and (11) by Qj(δj) for a collection of parameters (δ1, . . . , δd) [0, 1)d.

In this case, the results of Theorem 1 hold with Q j = (1 δj) 1 Bernoulli(1 δj) for

j = 1, . . . , d.

Blanchet, Kang, Olea, Nguyen and Zhang

5. Statistical Guidance on Choosing δ

Theorem 1 above showed that dropout training is distributionally robust optimal and that

nature s least favorable distribution is dropout noise with probability δ [0, 1). This

section suggests a strategy to pick this parameter to control the generalization error of

dropout training.

5.1 Additional Notation

Let bφn denote an arbitrary n-asymptotically normal estimator for the scale parameter φ.

Such an estimator can be the maximum likelihood estimator, which is known to be consistent

and asymptotically normal under mild regularity conditions on the joint distribution of

(Xi, Yi) (Fahrmeir and Kaufmann, 1985). We also allow bφn to be the dropout training

estimator of φ but with a dropout probability of c/ n. Finally, as we did before, we let bβ(δ) denote the dropout estimator of β under dropout probability δ.

In this section, we assume that the observed data was generated by a generalized linear

model. Let (β , φ ) denote the parameters that were used to generate the training data,

and let P denote the corresponding joint distribution of covariates and outcomes under

the true data generating process (hence, the training sample consists of n i.i.d. draws from

A key quantity in this section is

Ln(β, φ, δ) 1

i=1 Eδ [ ln f(yi|xi ξi, β, φ)] , (19)

which equals the in-sample (or empirical) loss at parameters β and φ, but averaged over

the dropout noise with dropout probability δ.

To speak about the generalization error of dropout training, we deﬁne the population

loss at (β, φ) as

L(β, φ) EP [ ln f(Y |X, β, φ)] . (20)

L(bβ(δn), bφn) (21)

is the loss that will arise from evaluating the negative log-likelihood in Equation (1) at

(bβ(δn), bφn) and then averaging over values of (X, Y ) drawn according to P . This corre-

sponds to the deﬁnition of generalization error used in Equation (2), evaluated at P . If the

negative log-likelihood is used as a measure of performance of the estimators (bβ(δn), bφn),

then an upper bound on (21) can be thought of as an upper bound on the generalization

error of (bβ(δn), bφn).

Dropout Training is Distributionally Robust Optimal

The main result in this section shows that the event

Ln(bβ(δn), bφn, δn) L((bβ(δn), bφn)), (22)

can be guaranteed to hold with probability at least 1 α, for an appropriate choice of the

sequence δn. This means that we can upper bound the generalization error of the estimators

(bβ(δn), bφn) with probability at least 1 α, as we explain next.

5.2 Additional Assumptions and Main Result of this Section

Assumption 2 The log-partition function Ψ( ) has a bounded second derivative.

Assumption 3 The second moment matrix EP [XX ] is ﬁnite, positive deﬁnite.

Assumption 4 For some σ2:

n Ln(β , bφn, 0) L(β , φ ) d N(0, σ2).

We can now state this section s main result. Deﬁne the random variable

µn(β, φ, δ) Ln(β, φ, δ) Ln(β, φ, 0), (23)

which is nonnegative for any δ [0, 1) by Theorem 1.

Theorem 2 Suppose that Assumptions 2, 3, 4 hold. Then for any sequence δn = c/ n,

n Ln(bβ(δn), bφn, δn) L((bβ(δn), bφn) d N(µ (β , φ , c), σ2),

where µ (β , φ , c) 0 is a linear function of c, strictly increasing for almost every (β , φ ),

and equal to the probability limit of

nµn β , bφn, δn ,

and µn( ) is deﬁned as in Equation (23).

Proof See Appendix A.3.

The main message of Theorem 2 is that the probability of the event

Ln(bβ(δn), bφn, δn) L((bβ(δn), bφn)

Blanchet, Kang, Olea, Nguyen and Zhang

can be approximated, as the sample size goes large, by the probability under a normal

random variable with positive mean of the positive half of the real line. That is:

P Ln(bβ(δn), bφn, δn) L((bβ(δn), bφn) = P N(µ (β , φ , c), σ2) 0 + o(1).

Since µ is nonnegative, it is a strictly increasing function of c for almost every (β , φ ),

then c can be chosen to guarantee that the probability of the ﬁrst term in the right-hand

side of the equality is as close to one as desired. This means that we can choose the

dropout probability to guarantee that the generalization error associated with the estimator

(bβ(δn), bφn) which we have denoted as L(bβ(δn), bφn) admits an upper bound (estimable

based on information in the sample) with high probability. See Equation (29) for our

explicit recommendation on how to choose δ = c/ n based on Theorem 2.

Some elementary algebra can be used to illustrate the main argument behind the proof.

It is convenient to start analyzing the diﬀerence

n Ln(bβ(δn), bφn, δn) L(β , φ )

instead of n Ln(bβ(δn), bφn, δn) L((bβ(δn), bφn) .

Algebra shows that

n Ln(bβ(δn), bφn, δn) L(β , φ ) = n Ln(bβ(δn), bφn, δn) Ln(β , bφn, δn) (24)

+ nµn(β , bφn, δn) (25)

+ n Ln(β , bφn, 0) L(β , φ ) . (26)

We show that the sum of these terms converges in distribution to a normal. The proof has

three main steps that we explain below.

Step 1. We start by showing that term (24) converges in probability to zero. To do this,

we show that n(bβ(c/ n) β ) (27)

is asymptotically normal and that the derivative of Ln( ) with respect to β (evaluated at β )

converges in probability to zero. One important observation is that mean of the asymptotic

distribution of Equation (27) is nonzero, and depends linearly on c. Consequently, even

though the dropout training estimator is asymptotically normal, its limiting distribution

exhibits bias (and the norm of such bias increases linearly as a function of c2). This ﬁrst

step thus shows that choosing a larger dropout probability comes at the cost of a decrease in

accuracy in the estimation of β . Our recommended procedure for choosing c will guarantee

Dropout Training is Distributionally Robust Optimal

that the generalization error of dropout training will be bounded with some prespeciﬁed

probability, despite the decrease in estimation accuracy.

Step 2. We then show that term (25) has a ﬁnite probability limit. This is the key step in

the proof. Importantly, we can characterize this limit explicitly and show that

nµn(β , bφn, δn) p c µ,

ξ A EP [Ψ((X ξ) β )]

d EP [Ψ(X β )] + EP [Y X ]β

and A is the collection of all vectors in {0, 1}d for which there is only one zero. In Section

7, we provide an expression for this term in the linear regression model.

Step 3. The term (26) above has, by Assumption 4, an asymptotically normal distribution

with mean zero and variance σ2.

Step 4. Finally, in order to establish Theorem 1 we show that

n L((bβ(δn), bφn)) L(β , φ ) = o P (1), (28)

which means that the expected loss evaluated at (bβ(δn), bφn) converges in probability to

L(β , φ ) at a rate of at least n 1/2.

It is important to mention that the DRO interpretation of dropout training can be

leveraged to select the dropout parameter δ. For example, a possible approach is choosing δ

so that the true data-generating process belongs to nature s choice set with some prespeciﬁed

probability. This approach, which is often advocated in the literature on machine learning

and robustness (Hansen and Sargent, 2008), often leads to a very pessimistic selection of δ

simply because this criterion is not informed at all by the loss function deﬁning the decision

problem. Further, in our problem, it is impossible to apply this approach, given that the

set of multiplicative perturbations of the empirical distribution will generally not cover the

true data-generating process.

Another approach involves using generalization bounds leading to ﬁnite sample guar-

antees; see, for instance, a summary of this discussion in Section 6.2 of Rahimian and

Mehrotra (2019). This method, while appealing, often requires either distributions with

compact support or strong control on the tails of the underlying distributions. Also, often,

the bounds depend on constants that may be too pessimistic or too diﬃcult to compute.

Finally, Blanchet et al. (2019b) recently introduced a method for the case in which

nature s choice set is deﬁned in terms of the Wasserstein distance around the empirical

Blanchet, Kang, Olea, Nguyen and Zhang

distribution. The idea therein is that for a ﬁxed δ every distribution that belongs to

nature s choice set corresponds to an optimal parameter choice for the statistician. Thus,

one can collect each and every of the statistician s optimal choices associated with each

distribution in nature s uncertainty set, and treat the resulting region as a conﬁdence set

for the true parameter. This conﬁdence set grows bigger (in the sense of nested conﬁdence

regions) as δ increases. The goal is then to minimize δ subject to a desired level of coverage

in the underlying parameter to estimate. This leads to a data-driven choice of δ explicitly

linked to the statistician s decision problem. However, this approach is not feasible for our

problem because, once again, regardless of the value of δ, the parameter choices for each of

the multiplicative perturbations of the empirical distribution will generally fail to cover the

true parameter.

5.3 Our recommendation for choosing δ.

We advocate choosing the parameter δ to control how often the in-sample loss obtained

from dropout training exceeds the population loss. The proof of Theorem 2 shows that

µ (β , φ , c) is of the form c µ, where µ depends on (β , φ ). Consequently, as long as

µ > 0, it is straightforward to pick c to guarantee a pre-speciﬁed coverage of the population

loss: for any α (0, 1), if we pick c to be

z1 α σ/µ, (29)

where z1 α is the 1-α quantile of a standard normal, and µ is evaluated at consistent

estimators of β and σ then the probability of the event (22) asymptotically approaches

6. An Algorithm for Dropout Training

The goal of this section is to suggest an algorithm for solving the dropout training problem

inf θ Θ EQ [ℓ(X ξ, Y, θ)],

where Q = b Pn Q 1 . . . Q d and Q j , j = 1, . . . , d is the dropout noise distribution

deﬁned in (14). Notice that we here consider the speciﬁc case in which Q0 is set to the

empirical measure b Pn supported on n training samples {(xi, yi)}n i=1. We will use θ n to

denote the solution of the dropout training problem above. It will sometimes be convenient

Dropout Training is Distributionally Robust Optimal

to rewrite this dropout training problem as

i=1 EQ [ℓ(X ξ, Y, θ) | X = xi, Y = yi], (30)

which coincides with expression (5). Conditioning on the values of (xi, yi) makes it clear

that the expectation is computed over the d-dimensional vector ξ. We now brieﬂy describe

three common approaches to implement dropout training and discuss some of its limitations.

6.1 Naive Dropout Training

Because Q j places mass on only two points, namely 0 and (1 δ) 1, the support of the

joint distribution Q 1 . . . Q d has cardinality 2d. Thus, a na ıve approach to solving the

dropout training problem speciﬁed by Equation (30) is to expand the objective function as

a sum with n 2d terms, then apply a tailored gradient descent algorithm to the resulting

optimization problem. Computationally, however, this approach is too demanding because

the number of individual terms in the objective function grows exponentially with d.

6.2 Dropout Training via Stochastic Gradient Descent

Another method to solve the dropout training problem in Equation (13) is to use stochastic

gradient descent (Robbins and Monro, 1951). This is tantamount to i) taking a draw of

(xi, yi) according to its empirical distribution, ii) independently taking a draw of ξi using

the distribution in Equation (3), and iii) computing the stochastic gradient descent update

ln f(yi|xi ξi, β, φ).

Given a current estimate bθ, we compute an unbiased estimate of the gradient to the

objective function of (13), and move in the direction of the negative gradient by a step of

suitable size. Since Q is discrete, the expectation under Q can be written as a ﬁnite

sum, and, by diﬀerentiating under the expectation, we have

θEQ [ℓ(X ξ, Y, bθ)] = EQ h θℓ(X ξ, Y, bθ) i . (31)

The standard SGD algorithm uses a na ıve Monte Carlo estimator as an estimate of the

gradient (31), that is, at iteration k N with incumbent solution bθk,

θEQ [ℓ(X ξ, Y, bθk)] θℓ(xk ξk, yk, bθk),

where (xk, yk, ξk) is an independent draw from Q .

Blanchet, Kang, Olea, Nguyen and Zhang

One drawback of using SGD for our problem is that it is not easily parallelizable, and

thus its implementation can be quite slow. Moreover, under the strong convexity assump-

tion of the loss function ℓ, SGD exhibits only linear convergence (Nemirovski et al., 2009,

Section 2.1). In contrast, gradient descent (GD) oﬀers exponential convergence (Boyd and

Vandenberghe, 2004, Section 9.3.1).

6.3 Na ıve Monte Carlo Approximation for Dropout Training

Consider solving dropout training problem (30) using a na ıve Monte Carlo approximation.

Instead of using 2d terms to compute

EQ [ℓ(X ξ, Y, θ) | X = xi, Y = yi],

we approximate this expectation by taking a large number of K i.i.d. draws {ξk i }K k=1,

ξk i Rd, according to the distribution Q 1 . . . Q d . When d is large, this approximation

is computationally cheaper than the na ıve dropout training procedure described above,

provided that K 2d.

Thus, the na ıve Monte Carlo approximation of the dropout training problem is

min θ Θ 1 n

k=1 ℓ(xi ξk i , yi, θ)

where the random vectors ξk i are sampled independently over both k and i using the

distribution Q 1 . . . Q d .

Relative to the solution of the dropout training problem which we denoted as θ n the

minimizer of approximation (32) is consistent and asymptotically normal as K . This

follows by standard arguments; for example, those in Shapiro et al. (2014, Section 5.1).

There are, however, two problems that arise when using problem (32) as a surrogate for the

dropout training problem. First, the solution to approximation (32) is a biased estimator

for θ n . This means that, if we average the solution over the K n diﬀerent values of ξk i , the

average solution need not equal θ n . Second, implementing approximation (32) requires a

choice of K and, to the best of our knowledge, there is no oﬀ-the-shelf procedure for picking

6.4 Unbiased Multi-level Monte Carlo Approximation for Dropout Training

To address these two issues, we apply the recent techniques suggested by Blanchet et al.

(2019a) that we refer to as unbiased multi-level Monte Carlo approximations. Multi-level

Monte Carlo methods (Giles, 2008, 2015) are a set of techniques for approximating the ex-

Dropout Training is Distributionally Robust Optimal

pectation of random variables. The adjective multi-level emphasizes the fact that random

samples of diﬀerent levels of accuracy are used in the approximation. Before presenting the

detailed algorithm, we provide a heuristic description. To this end, let bθ n (K) denote the

level K solution of the problem in (32); that is, the solution based on K draws. Deﬁne the

random variable

K bθ n (K) bθ n (K 1)

and, for simplicity, assume that bθ n (0) is a vector of zeros. Under suitable regularity condi-

K=1 E[ K] = lim K E[bθ n (K)] = θ n .

Consider now picking K randomly from some discrete distribution supported on the natural

numbers. Let p( ) denote the probability mass function of such distribution and consider a

Monte Carlo approximation scheme in which after drawing K we sample K n diﬀerent

random vectors ξk i Rd according to Q 1 . . . Q d . The estimator

has two sources of randomness. These are, ﬁrstly, the random choice of K and, secondly,

the random draws ξk i . Averaging over both yields

K=1 E[Z(K )|K = K] p(K) =

K=1 (E[ K]/p(K)) p(K) = θ n .

Thus, by taking into account the randomness in the selection of K, we have managed to

provide a rule for deciding the number of draws (speciﬁcally, our recommendation is to pick

K at random), and, at the same time, we have removed the bias of na ıve Monte Carlo

approximations.

One possible concern with our suggested implementation is that the expected computa-

tional cost of Z(K ) could be inﬁnitely large. Fortunately, this issue can be easily resolved

by an appropriate choice of the distribution p( ). To see this, deﬁne the computational cost

simply as the number of random draws that are required to obtain Z(K ). In the construc-

tion we have described above, we need K n draws for the construction of the estimator.

Thus, the average cost is

Blanchet, Kang, Olea, Nguyen and Zhang

which, under mild integrability conditions on p( ), will be ﬁnite.10

We now present the algorithm that will be used to solve the dropout training problem.

To ensure that the estimator Z(K ) has a ﬁnite variance, instead of deﬁning K as the

diﬀerence between the level K and K 1 solutions to the approximation problem (32) in

the above heuristic arguments, we use solutions for a sample of size 2K+1 and with its odd

and even sub-samples of size 2K.

6.5 Algorithm for the Unbiased Multilevel Monte Carlo

We present a parallelized version using L processors, which works even when L = 1. Parallel

computing reduces the variance of the estimator, so we suggest using as many processors

as are available per run.

Fix an integer m0 N such that 2m0+1 2d. For each processor l = 1, . . . , L we

consider the following steps.

i) Take a random (integer) draw, m l , from a geometric distribution with parameter

ii) Given m l , take 2K l +1 i.i.d. draws from the d-dimensional vector ξi Q 1 . . . Q d ,

K l m0 + m l .

Repeat this step independently for each i = 1, . . . , n.

iii) Solve problem (32) using the ﬁrst 2m0 i.i.d. draws of ξi for each i. Let θl,m0 denote a

iv) Denote by bθ n (2K l +1), bθO n (2K l ), and bθE n (2K l ) any solution to the following optimiza-

tion problems (all of which are based on sample average approximations as the Monte

10. For example, if p( ) is selected as a geometric distribution with parameter r, the expected computational cost will be n(1 r)/r. 11. To see why we require that r > 1/2, notice that, if the computational cost of evaluating Z(K ) (as in the heuristic description above) increases exponentially in K and takes the form C 2K, the expected computational cost will be X

K=1 Cr(2(1 r))K = Cr(1/2(1 r)),

provided 2(1 r) < 1, or equivalently, r > 1/2. As we show in the proof of Theorem 3, constraining the variance requires then imposing r < 3/4. Ultimately, optimizing the product of computational cost and variance leads to the optimal selection r = 1 2 3/2.

Dropout Training is Distributionally Robust Optimal

Carlo approximation 32):

bθ n (2K l +1) arg min θ Θ 1 n

k=1 ℓ(xi ξk i , yi, θ)

bθO n (2K l ) arg min θ Θ 1 n

k=1 ℓ(xi ξ2k 1 i , yi, θ)

bθE n (2K l ) arg min θ Θ 1 n

k=1 ℓ(xi ξ2k i , yi, θ)

Intuitively, bθO n and bθE n denote the solutions to problem (32) but using a sample of size

2K l with only odd indices and even indices, respectively.

v) Deﬁne K l bθ n (2K l +1) 1

2(bθO n (2K l ) + bθE n (2K l ))

Z(K l ) = K l r(1 r)K l m0 + θl,m0.

Our recommended estimator is

l=1 Z(K l ).

We now show that the suggested algorithm gives an estimator with desirable properties.

We do so under the following regularity assumptions.

Assumption 5 Suppose that the parameter space Θ is compact. Suppose in addition that

the optimal solution θ n to the dropout training problem in (30) is (globally) unique.

Assumption 6 Let bθ n (K) denote the solution of the problem in (32) based on K draws.

Suppose that as K ,

E[ K 1 2 (bθ n (K) θ n ) 4 2) = O(1),

where the expectation is taken over the i.i.d. dropout noise distribution used to generate ξk i .

Assumption 7 Assume that, for each (X, Y, ξ), ℓ(X ξ, Y, ) is thrice continuously diﬀer-

entiable over Θ and that the Hessian matrix θθEQ [ℓ(X ξ, Y, θ n )] is non-singular.

Blanchet, Kang, Olea, Nguyen and Zhang

Theorem 3 Under Assumption 5, E[Z(K l )] = θ n . The number of random draws required

to compute Z(K l ) is n 2K l +1 and thus the expected computational complexity for producing

Z(K l ) equals n(2m0+1)r

2r 1 < n(2m0+1) n2d.

Suppose, in addition, that bθ n (K) is almost surely in the interior of Θ for large enough K.

If Assumptions 6 and 7 hold, and r < 3/4, then Var(Z(K l )) < .

Proof See Appendix A.4.

Our suggested algorithm has ﬁnite expected computational complexity that does not

grow exponentially with the dimensionality d, thus, every time we need to obtain bθ n (2K l +1),

we can do so by gradient descent. Combined with parallelization, the unbiased multi-level

Monte Carlo approach produces an unbiased estimator with a variance that can be made

arbitrarily small if L is large enough, provided that the regularity assumptions that give

Var(Z(K l )) < hold.

7. Numerical Experiments

We conduct numerical experiments in this section to compare our preferred implementation

of dropout training to stochastic gradient descent, as well as our recommended selection

of δ to cross-validation. The beneﬁts of our suggested unbiased multi-level Monte Carlo

algorithm are analyzed using high-dimensional regression, whereas our selection of δ is

analyzed using a low-dimensional regression model.

7.1 Advantage of the Unbiased Multi-level Monte Carlo Estimator

We present a simple numerical experiment to illustrate the advantage of using the unbiased

multi-level Monte Carlo estimator suggested in Section 6.4. We consider the linear regression

problem with known variance and we focus on solving the dropout training problem with

our recommended δ chosen according to Proposition 2.

Our simulation setting considers a linear regression model with a covariate vector having

dimensionality d = 100 and sample size n = 50. We pick a known regression coeﬃcient

β0 Rd being a vector with all entries equal to 1. With ﬁxed coeﬃcients, we assume the

covariate vector follows an independent Gaussian, as well as for the regression noise. More

speciﬁcally, we can get our 50 observations (xi, yi) via

sampling xi N(0, Id), i = 1, . . . , n,

Dropout Training is Distributionally Robust Optimal

sampling yi R conditional on xi, where yi is given by the linear assumption and εi are i.i.d. random noise following N(0, 102), for i = 1, . . . , n.

Our simulation ﬁrst considers a high-dimension setting (relatively low ratio between sample

size per dimension n/d = 0.5) with high noise-to-signal ratio (variability on residual noise

is high compared to the variability on xi).

If we set Q0 to be the empirical distribution of {(xi, yi)n i=1}, the dropout training problem

in the linear regression model is

min β Rd EQ β (X ξ) Y 2 .

Corollary 2 in Appendix A.5 shows that, in the linear regression model, the dropout training

problem can be written as

min β Rd 1 n

(Y Xβ) (Y Xβ) + δ 1 δβ Λβ ,

where Y = [y1, y2, . . . , yn] , X = [x1, x2, . . . , xn] and Λ is the diagonal matrix with its

diagonal elements given by the diagonals of X X. Moreover, there is a closed-form solution

for the dropout training problem and it is given by the ridge regression formula:

β n = X X + δ 1 δΛ 1 X Y.

We choose the dropout probability δ following Proposition 2. More speciﬁcally, Propo-

sition 2 suggests the choice δ = c/ n where c = z1 α σ/µ. For linear regression with

known variance, it is straightforward to compute

j=1 EP [X2 j ](β j )2,

σ2 = Var P 1

2 log(2πφ ) + (Y (β ) X)2

Choosing α = 0.1 and note that β = β0, φ = 102, we have δ 0.26.

Since neither our suggested multi-level Monte Carlo algorithm nor standard SGD (as

deﬁned in Section 6.2) uses closed-form formulae for their implementation, we analyze the

extent to which these procedures can approximate the parameter β n. The two algorithms

we compare are:

Blanchet, Kang, Olea, Nguyen and Zhang

Figure 1: l2 diﬀerence

Standard SGD algorithm with a learning rate 0.0001 and initialization at the origin.

Note that we take batched SGD instead of the single-sample SGD introduced in

Section 6.2.

Multi-level Monte Carlo algorithm with a geometric rate r = 0.6 and a burn-in period

m0 = 5. Note that in each parallel run, we use gradient descent (GD) with 0.01

learning rate and initialization at origin for steps iii) and iv) in Section 6.4.

We run our simulation on a cluster with two Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40 GHz

processors (with 10 cores each), and a total memory of 128 GB. We ﬁx 60 seconds as a

wall-clock time , so that we terminate the two algorithms after 60 seconds.12 We run 1000

independent experiments. For each run, we calculate and report the average parameter

estimation divergence to β n and 1-standard deviation error bar for the divergence. We

consider diﬀerent number of parallelizations (i.e., L in Section 6.4) from 400 to 2400. We

cap the run at 2400 due to the saturation of divergence after 2000 parallelizations.

Figure 1 shows the l2 divergence from the true β n of the two algorithms for varying

L, while Figure 2 and Figure 3 show the l and l1 divergence, respectively. We observed

that our unbiased estimator outperforms the standard SGD algorithm once the number of

parallel iterations exceeds some moderate threshold ( 1000 here). We provide supporting

evidence in Appendix A.7 to argue our choice of the learning rate, initialization, and wall-

clock time, where our proposed algorithm is robust to any reasonable choices.

12. The parameters for the SGD algorithm are appropriately tuned to achieve good convergence within 60 s (see Appendix A.7 for the tuning procedure). We do not claim that this choice of parameters is optimal.

Dropout Training is Distributionally Robust Optimal

Figure 2: l diﬀerence

Figure 3: l1 diﬀerence

Blanchet, Kang, Olea, Nguyen and Zhang

7.2 Coverage of the True Loss of Dropout Training

Here, we validate that our recommended selection of δ guarantees that the in-sample loss

of dropout training is covering the true loss with arbitrarily high probability as prescribed

by Proposition 2.

We use the same linear regression model with dimensionality d = 10 and n {103, 104}

training samples. We choose diﬀerent quantiles of the normal as in Proposition 2. We also

include 10-fold cross-validation and ordinary least squares for comparison; see Table 1, where

we estimate the frequency of coverage over 1000 independent runs. The main message is

that our suggested choice of δ guarantees that the in-sample loss of dropout training exceeds

the true, unknown, population loss with probability 1 α. When using standard OLS or

choosing δ by cross-validation, the in-sample loss is smaller than the population loss with

probability close to 1/2, which implies that these methods are unsatisfactory in terms of

frequency of coverage.

α = 0.2 α = 0.1 α = 0.05 10-fold CV plain OLS n = 103 0.77 0.01 0.88 0.01 0.94 0.01 0.52 0.02 0.40 0.01 n = 104 0.79 0.01 0.90 0.01 0.94 0.01 0.49 0.02 0.47 0.02

Table 1: Frequency of in-sample loss covering the true population loss. Our recommended selection of δ = c/ n with c = z1 ασ/µ has a theoretical 1 α coverage probability.

8. Extensions

In this section, we discuss the extent to which the decision-theoretic support for dropout

training carries over to neural networks. The main idea is to use a GLM model where

the natural parameter is no longer a linear function of the covariates, but instead a neural

network. We note that dropout training in neural networks typically drops neuron activa-

tions in both the input layer and the hidden layer. Dropout only at the input layer (data

augmentation) is more closely related to our analysis in the context of GLM models and is

also considered in the literature, for example by Devries and Taylor (2017) and Park et al.

(2019) for vision and speech recognition tasks. In the remaining part of this section, we

consider feed-forward neural network with one hidden layer, and consider dropout noise in

both the input and hidden layers.

We also note that there are already concrete recommendations for how to select the

dropout probability in neural networks. For example, Goodfellow et al. (2016, p. 257)

says, The probability of sampling a mask value of one (causing a unit to be included) is

a hyperparameter ﬁxed before training begins. It is not a function of the current value of

Dropout Training is Distributionally Robust Optimal

the model parameters or the input example. Typically, an input unit is included with a

probability 0.8, and a hidden unit is included with a probability 0.5. We then run forward

propagation, back-propagation, and the learning update as usual. It is not entirely clear

to us what the theoretical support is for this recommendation. We think that our results

for choosing δ in GLM models (which depends on the sample size, and on the estimated

values of the parameters of the model) provide, at the very least, a principled alternative

to select the dropout probability of features in the hidden layer that precedes the output

layer. Our simulations for GLM models (Table 1) suggest that this strategy indeed gives

good bounds on generalization error.

8.1 One-hidden-layer Feed-Forward Neural Networks

Suppose the scalar response variable Y is generated by the conditional density

f(Y |X, θ, φ) h(Y, φ) exp ((Y Ωθ(X)) Ψ(Ωθ(X))) /a(φ)) , (33)

where Ωθ(X) is a neural network with parameters θ and X Rd. This is a simple extension

of the regression model that has been used recently to study deep neural networks see

Schmidt-Hieber (2020) in which the conditional density is Gaussian.

In this section, we will assume that Ωθ(X) is a neural network with a single hidden layer,

a diﬀerentiable activation (squashing) function, and a linear output function. A function

h : R [0, 1] is a squashing function if it is non-decreasing and if

lim r h(r) = 1, lim r h(r) = 0.

For further detail, see Hornik et al. (1989, Deﬁnition 2.3).

Although these types of networks which will be formally described below are restric-

tive compared to the modern deep learning architectures, they can approximate any Borel

measurable function from a ﬁnite-dimensional space to another, provided the hidden units

in the hidden layer are large (Hornik et al., 1989).

Consider a neural network with K units in the hidden layer, each using input weights

wk Rd, k = 1, . . . , K. Denote the activation function in the hidden layer as h( ). Assume

a linear output function with a vector of weights β RK. Thus, the network under

consideration is deﬁned by the function:

Ωθ(X) β1h(w 1 X) + . . . + βKh(w KX) = β H(X),

Blanchet, Kang, Olea, Nguyen and Zhang

where H(X) = (h(w 1 X), . . . , h(w KX)) . The neural network is parameterized by θ

(β , w 1 , . . . , w k ) . Under this model, the distribution of Y |X is a GLM model with co-

variates H(X).

8.1.1 Statistician s Objective Function

We will endow the statistician with the loss function given by the negative of the conditional

log-likelihood for the model in Equation (33).

8.1.2 Nature s Uncertainty Set

We allow nature to introduce additional noise to the statistician s model. We do this in

two steps. First, we allow nature to distort the distribution of X using a multiplicative

noise denoted as ξ(1) Rd. This is exactly analogous to what we did in the GLM model,

where nature was allowed to pick a distribution for the covariates of the form (X ξ(1)).

We allow nature to contaminate the input layer with independent and multiplicative noise.

Second, we also allow nature to contaminate each of the hidden units with multiplicative

noise ξ(2) RK. That is, nature is also allowed to pick a vector ξ(2) = (ξ(2)1, . . . , ξ(2)K) ,

independently of ξ(1) Rd, to distort the each of the K units in the hidden layer as

H(X) ξ(2) (h(w 1 X)ξ(2)1, . . . , h(w KX)ξ(2)K) .

Our choice of a one-hidden neural network was simply for expositional simplicity, but

the analysis would be the same with a feed-forward neural network with L hidden layers.

8.1.3 Minimax Solution

The minimax solution of the DRO game is given by

inf θ sup Q EQ [ ln f(Y |H(X ξ1) ξ2, β, φ)] , (34)

where Q now refers to the joint distribution of (X, Y, ξ(1), ξ(2)) and f(Y |X, β, φ) is the GLM

density deﬁned in (1). We continue working with the assumption that ξ (ξ(1) , ξ(2) )

has independent marginals and that it is independent of (X, Y ).

We would like to solve for the worst-case distributions of the random vectors ξ(1) and

ξ(2), assuming that both of these satisfy the restrictions analogous to Equation (11). The

solution for the distribution of ξ(2) can be obtained as a corollary to Theorem 1, as it

suﬃces to deﬁne X H(X ξ(1))

Dropout Training is Distributionally Robust Optimal

and to view (34) as the DRO problem in a linear regression model, in which the data is

( X, Y ) and ξ(2) RK is simply the multiplicative noise that transforms the covariates into

The worst-case choice of ξ(1), the multiplicative error for the inputs, is more diﬃcult

to characterize and we were not able to ﬁnd general results for it. Below, we provide a

heuristic argument suggesting that dropout noise might approximate the worst-case choice

when the output layer is a Gaussian linear model. Let ξ(1)j denote the j-th coordinate of

ξ(1). Suppose that the distribution of this random variable places most of its mass on the

interval [1 ϵ, 1 + ϵ].13 This allows us to linearize the output of each of the hidden units

around the output corresponding to unperturbed inputs as

h(w k (X ξ(1))) = h(w k (X (ξ(1) 1)) + w k X)

h(w k X) + h(w k X) (wk X) (ξ(1) 1) .

In the notation above, 1 denotes the d-dimensional vector of ones. For the sake of exposition,

ignore the approximation error in the linearization above. If we ﬁx (X, Y, ξ(2)), then the

worst-case choice for the distribution of ξ(1), denoted by Q(1), maximizes

k=1 βk ξ(2)k

j=1 wk,j Xj (ξ(1)j 1)

among all distributions with independent marginals for which EQ(1)[ξ(1)j] = 1 for all j =

1, . . . , d. It can then be shown that such a maximization problem is equivalent to maximizing

k=1 βk ξ(2)k h(w k X)

j=1 wk,j Xj (ξ(1)j 1)

which, in turn, can be written as

a (ξ(1) 1) 2

for an appropriate choice of a vector a Rd that depends only on (β, ξ(2), h, h, w, X).

Lemma 2 in Appendix A.2 shows that the solution to this problem is dropout noise.

13. This is compatible with dropout noise for which δ is very close to zero.

Blanchet, Kang, Olea, Nguyen and Zhang

9. Concluding Remarks

This paper examines dropout training, an increasingly popular estimation method in ma-

chine learning. Dropout training is a fundamental part of modern machine learning tech-

niques for training very deep networks (Goodfellow et al., 2016).

Our main result (Theorem 1) established a novel decision-theoretic foundation for the

use of dropout training. We showed that this method, when applied to generalized linear

models, can be viewed as the minimax solution to an adversarial two-player, zero-sum

game between a statistician and nature. The framework used in this paper is known in

the stochastic optimization literature (Shapiro et al., 2014) as a distributionally robust

optimization (DRO) problem.

Our minimax result showed, by construction, that dropout training indeed provides

out-of-sample performance guarantees for distributions that arise from multiplicative per-

turbations of the in-sample data. Our result thus justiﬁed, explicitly, the ability of dropout

training to enhance a predictor s out-of-sample performance, which is one of the reasons

often invoked to promote the dropout method.

In addition to our theoretical result, we also suggested a new strategy to select the

dropout probability and a new stochastic optimization implementation of dropout training.

For the latter, we borrowed ideas from the multi-level Monte Carlo literature in particular

from Blanchet et al. (2019a) to suggest an unbiased dropout training routine that is easily

parallelizable and that has a smaller computational cost than na ıve dropout training meth-

ods when the number of features is large (Theorem 3). Crucially, we showed that under

some regularity conditions, our estimator has ﬁnite variance (which means that there are

also theoretical, and not just practical, gains from parallelization).

The connection between dropout training and the multiplicative errors-in-variables model

established in this paper is novel. We think this connection is potentially interesting because

multiplicative errors have found a range of applications in empirical work across various dis-

ciplines, from economics to epidemiology. For example, Alan et al. (2009) uses it to account

for measurement error in consumption data when estimating the elasticity of intertemporal

substitution via Euler equations. Pierce et al. (1992) and Lyles and Kupper (1997) use

it to relate health outcomes to the exposure of a chemical toxicant that is observed with

error. Moreover, due to privacy considerations, statistical agencies such as the U.S. Census

Bureau sometimes mask data using multiplicative noise, as discussed by Kim and Winkler

(2003) and Nayak et al. (2011). Examples of data sets that contain variables masked with

multiplicative noise include the Commodity Flow Survey Data (2017) and the Survey of

Business Owners (2012) both from the U.S. Census Bureau and the U.S. Energy Infor-

mation Administration Residential Energy Consumption survey. Applications of dropout

training in these contexts could be an interesting area for future work.

Dropout Training is Distributionally Robust Optimal

We also discussed the extent to which our theoretical results extended to Neural Net-

works (in particular, to the universal approximators in Hornik et al. 1989 consisting of a

single-hidden layer and a squashing activation function). Our results showed that Theorem

1 can be used to establish the optimality of dropout training to estimate the parameters of

the last hidden layer in general feed-forward neural networks, where the output layer takes

the form of a generalized linear model. We hope that our analysis serves as a foundation to

understand the beneﬁts of dropout training in neural networks.

Acknowledgments

We would like to thank Matias Cattaneo, Max Farrell, Michael Leung, Ulrich M uller, Mark

Peletier, Hashem Pesaran, Ashesh Rambachan, Roger Moon, Frank Schorfheide, Stefan

Wager, and participants at the Statistics Seminar series at Columbia University for helpful

comments and suggestions. Jos e Blanchet acknowledges support from NSF grants 1820942,

1838576, 1915967, 2118199, 2229012, 2312204 and the Chinese Merchant Bank. Material

in this paper is based upon work supported by the Air Force Oﬃce of Scientiﬁc Research

under award number FA9550-20-1-0397. Viet Anh Nguyen acknowledges the support from

the CUHK s Improvement on Competitiveness in Hiring New Faculties Funding Scheme.

Blanchet, Kang, Olea, Nguyen and Zhang

Appendix A.

This appendix collections all the proofs of main results, as well as additional theoretical

and empirical results.

A.1 Probability Limit of the Dropout Training Estimator for β

Proposition 1 Suppose that Assumptions 2 and 3 hold. Let the data {(xi, yi)}n i=1 consist

of n i.i.d. draws from a distribution P . Then for any sequence δn δ [0, 1) as n , bβ(δn) converges in probability to

β (δ) arg minβ EP [Eδ [ ln f(Y |X ξ, β, φ)]] , (36)

where the minimizer in (36) is unique and does not depend on φ.

Proof The dropout estimator of β maximizes

i=1 yi(β xi) Eδn[Ψ(β (xi ξ))].

In a slight abuse of notation, let ξδ denote a realization of dropout noise parameterized by δ.

Then it is possible to re-write the objective function as a weighted average of the functions

Qn,ξδn(β) 1

i=1 yi(β xi) Ψ(β (xi ξδn)).

It will be convenient to deﬁne the limiting objective function to be

Q(β) EP [Y (β X)] EP [Eδ[Ψ(β (X ξ))]],

which, by Assumptions 1 and 2, is ﬁnite and strictly concave. The population objective

function is then the average (over dropout noise) of

Qξδ(β) EP [Y (β X)] EP [Ψ(β (X ξδ))].

It is straightforward to show that β (δ) in Equation (36) denotes the unique maximizer of

The proof of the proposition follows from standard arguments in the theory of extremum

estimators. In particular, it suﬃces to verify the conditions of Theorem 2.7 in Newey and

Mc Fadden (1994).

Dropout Training is Distributionally Robust Optimal

Condition i) in Newey and Mc Fadden (1994) requires Q(β) to be uniquely maximized

at β (δ). This holds because Assumptions 1 and 2 imply that Q(β) is strictly concave.

Condition ii) in Newey and Mc Fadden (1994) requires β (δ) to be an element in the

interior of a strictly convex set, which holds because, in the GLM models under consider-

ation, the parameter space is Rd. Furthermore, Qn(β) is trivially concave by Assumption

Condition iii) requires Qn(β) to converge in probability to Q(β) for every β. For this

purpose, it suﬃces to show that Qn,ξδn(β) converges in probability to Qξδ(β) for each ﬁxed

β, and for a sequence ξδn and ξδ that have zeros and non-zeros in exactly the same entries.

Assumptions 1 and 2 imply EP [Y (β X)] < for all β. Thus, using the Law of Large

Numbers for i.i.d. sequences,

i=1 Yi(β Xi) p EP [Y (β X)].

Finally, Assumptions 1 and 2 imply that the triangular array

Zn,i = Ψ(β (Xi ξδn)), 1 i n,

satisﬁes the conditions for the Law of Large Numbers for triangular arrays (Durrett, 2019,

Theorem 2.2.11), and, consequently,

i=1 Ψ(β (Xi ξδn)) p EP h Ψ(β (X ξδ)) i .

This completes the proof.

A.2 Proof of Theorem 1

The proof of Theorem 1 relies on the following two preparatory results.

Lemma 1 (Extremal expectation of a univariate convex function) For any <

a < b < + , let ζ be a random variable in [a, b] with mean µ [a, b]. For any convex,

continous function f : [a, b] R, the distribution of ζ that maximizes E[f(ζ)] among all

distributions over [a, b] with a given mean µ [a, b] is a scaled and shifted Bernoulli distri-

bution, i.e.,

a with probability (b µ)/(b a),

b with probability (µ a)/(b a). (37)

Blanchet, Kang, Olea, Nguyen and Zhang

Proof Let Q denote the probability measure induced by the random variable in (37). By

EQ [f(ζ)] = b µ

b af(a) + µ a

Suppose ﬁrst that µ = a. In this case, Jensen s inequality implies that for any other

probability measure Q over [a, b] with mean µ = a,

EQ[f(ζ)] f(EQ[ζ]) = f(a) = EQ [f(ζ)].

An analogous result holds if µ = b.

Consider then the case in which µ (a, b). For an arbitrary probability measure Q over

[a, b] with mean µ (a, b), we have

[a,b] f(ζ)d Q = Z

[a,b] f ab ζ

b af(a) + ζ a

b a f(b) d Q,

where the inequality follows from the convexity of f. By the linearity of the integral operator

and the fact that R

[a,b] ζd Q = µ, we ﬁnd

[a,b] f(ζ)d Q b µ

b af(a) + µ a

Because the probability measure Q was chosen arbitrarily, this implies that the distribution

of ζ in Equation (37) maximizes the expectation of f(ζ).

Lemma 2 Fix a vector of tuning parameters δ (0, 1)d. Let Qj(δj) be deﬁned as in (10).

Suppose that A is a convex and continuous function on R. For any θ Rd, we have

sup n EQ1 ... Qd[A(θ ξ)] : Qj Qj(δj) o = EQ 1 ... Q d [A(θ ξ)],

where Q j is a scaled Bernoulli distribution of the form Q j = (1 δj) 1 Bernoulli((1 δj))

for each j = 1, . . . , d.

Proof First, note that Q j Qj(δj) for each j, and thus Q 1 . . . Q d is a feasible

solution to the maximization problem. It suﬃces to show that, for any set of feasible

measures Qj Qj(δj), j = 1, . . . , d, we have

EQ1 ... Qd[A(θ ξ)] EQ 1 ... Q d [A(θ ξ)].

Dropout Training is Distributionally Robust Optimal

Towards this end, pick any k {1, . . . , d}. By Fubini s theorem, we can write

EQ1 ... Qd[A(θ ξ)] = EQ1 ... Qk 1 Qk+1 ... Qd EQk[A(θ ξ)].

For any ﬁxed value (ξ1, . . . , ξk 1, ξk+1, . . . , ξd) the function ξk 7 A(P j =k θjξj + θkξk) is

convex in the variable ξk over the interval [0, (1 δk) 1]. Thus by Lemma 1,

j =k θjξj +θkξk)] EQ k [A( X

j =k θjξj +θkξk)] for any ﬁxed (ξ1, . . . , ξk 1, ξk+1, . . . , ξd).

Thus by the monotonicity of the expectation operator,

EQ1 ... Qd[A(θ ξ)] EQ1 ... Qk 1 Qk+1 ... Qd EQ k [A(θ ξ)]

= EQ1 ... Qk 1 Q k Qk+1 ... Qd[A(θ ξ)].

By cycling through all possible values of k {1, . . . , d} we conclude that

EQ1 ... Qd[A(θ ξ)] EQ 1 ... Q d [A(θ ξ)].

Therefore, the postulated claim holds.

We are now ready to prove Theorem 1.

Proof Note that for Q U(Q0, δ), Assumption ?? implies EQ[ℓ(X ξ, Y, θ)] is ﬁnite for

any θ Θ and any scalar δ [0, 1). Therefore, from Fubini s theorem and the deﬁnition of

the loss function:

EQ[ℓ(X ξ, Y, θ)] = EQ0 [EQ1 ... Qd[ℓ(X ξ, Y, θ)]]

= EQ0 h EQ1 ... Qd[ ln h(Y, φ) + (Ψ(β (X ξ)) Y (β (X ξ)))/a(φ)] i

= EQ0 [ln h(Y, φ)]

+ EQ0 h EQ1 ... Qd[(Ψ(β (X ξ)) Y (β (X ξ)))/a(φ)] i .

It can then be shown that, for any β, X and ξ:

β (X ξ) = (β X) ξ.

Thus, we can ﬁx the values of (X, Y, θ) and deﬁne the function

A(X,Y,θ)((β X) ξ) (Ψ(β (X ξ)) Y β (X ξ))/a(φ).

Blanchet, Kang, Olea, Nguyen and Zhang

Note that A(X,Y,θ) satisﬁes the condition of Lemma 2. Therefore

sup n EQ1 ... Qd[A(X,Y,θ)((β X) ξ)] : Qj Qj(δj) o = EQ 1 ... Q d [A(X,Y,θ)((β X) ξ)],

for any (X, Y, θ), which completes the proof.

A.3 Proof of Theorem 2

Proof We write n(Ln(bβ(δn), bφn, δn) L(β , φ )) as the sum of the following three terms

n Ln(bβ(δn), bφn, δn) Ln(β , bφn, δn) , (38a)

n Ln(β , bφn, δn) Ln(β , bφn, 0) , (38b)

n Ln(β , bφn, 0) L(β , φ , 0) . (38c)

By Assumption 4, the last term converges in distribution to a normal random variable, so

we only need to analyze terms (38a) and (38b). We establish the proof following the four

steps outlined in the main body of the paper.

Step 1: The same arguments as in Theorem 3.1 in Newey and Mc Fadden (1994) we can

show that for any sequence δn = c/ n

n(bβ(δn) β ) d Σ(β ) 1Nd( c µ, a(φ )Σ(β )),

Σ(β) EP [ Ψ(X β)XX ],

ξ A EP [ Ψ((X ξ) β )(X ξ)]

(d 1)EP [Y X] + Σ(β )β .

The set A above is deﬁned as {ξ {0, 1}d : exactly one entry of ξ is zero}. The argument

is essentially the same as in every proof of asymptotic normality for extremum (or M-

estimators), with the only diﬀerence being that, because of the dropout noise, the score

term is asymptotically normal with a nonzero mean. In fact,

βLn(β, bφn, δn) βLn(β, bφn, 0)+ 1

i=1 Eδn[(Xi ξ) Ψ(β (Xi ξ))] Xi Ψ(X i β)

Dropout Training is Distributionally Robust Optimal

βLn(β, bφn, 0) 1

i=1 Xi(Yi Ψ(X i β)).

Recognizing the term βLn(β, bφn, 0) as the negative of the score function in the GLM model

and doing some algebra, it is possible to show that βLn(β, bφn, δn) is op(1). Therefore (38a)

Step 2: For the term in (38b), note ﬁrst that it is nonnegative. Also: Ln(β , bφn, δn)

Ln(β , bφn, 0) equals

i=1 Eδn[Ψ((Xi ξ) β )] Ψ(X i β )

The term in parenthesis has a ﬁnite mean equal to

n EP Eδn[Ψ((X ξ) β )] EP [Ψ(X β )]. (39)

It can be shown by verifying the conditions for the Law of Large Numbers for triangular

arrays (Theorem 2.2.11 in Durrett 2019) that

n(Ln(β , δn) Ln(β , 0) a(bφn) 1 n) p 0.

Moreover, Assumptions 1 and 2 imply n n p , where

ξ A EP [Ψ((X ξ) β )] d EP [Ψ(X β )] + EP [ Ψ(X β )X β ]

Step 3: Since, by Assumption, the term in (34c) is asymptotically normal, then we have

shown that n(Ln(bβ(δn), bφn, δn) L(β , φ )) d N( /a(φ ), σ2).

Step 4: Finally, we show that

n L((bβ(δn), bφn)) L(β , φ ) = OP (1).

We have deﬁned

Ln(β, φ, δ) 1

i=1 Eδ [ ln f(yi|xi ξi, β, φ)] ,

L(β, φ) EP [ ln f(Y |X, β, φ)] .

Blanchet, Kang, Olea, Nguyen and Zhang

Let (β , φ ) be a stationary point of L in the interior of the parameter space, so that

β L(β , φ ) = 0, φL(β , φ ) = 0.

Further, let bφn denote an arbitrary n-consistent, asymptotically normal estimator of φ .

Under our assumptions, L continuously diﬀerentiable. Then, by Taylor expansion,

L(bβ(δn), bφn) L(β , φ ) =

β L( β, φ)(bβ(δn) β ) +

φL( β, φ)(bφn φ ),

where ( β, φ) lies on the line between (bβ(δn), bφn) and (β , φ ). We have shown that

n(bβ(δn) β ) d Σ(β ) 1Nd( c µ, a(φ )Σ(β )).

Also we have that β L( β, φ) 0, φL( β, φ) 0

in probability, since ( β, φ) (β , φ ) in probability and (β , φ ) is a stationary point of L.

Thus n(L(bβ(δn), bφn) L(β , φ )) 0,

in probability by Slutsky s theorem. Moreover, because L is twice continuously diﬀeren-

tiable, then

n(L(bβ(δn), bφ) L(β , φ )) = Op(1).

A.4 Proof of Theorem 3

Proof By deﬁnition

Z(K l ) = K l r(1 r)m l + θl,m0,

where K l is a discrete random variable with probability mass function:

p(K l ) = r(1 r)K l m0,

and supported on the integers larger than m0.

Dropout Training is Distributionally Robust Optimal

We ﬁrst show that the estimator Z(K l ) is unbiased (as we average over both K l and

ξk i ). Algebra shows that

E[Z(K l )] =

K=m0 E[Z(K l )|K l = K]p(K)

" K l p(K l ) + θl,m0

p(K) + θl,m0

K=m0 E bθ n (2K+1) 1

2(bθO n (2K) + bθE n (2K))

E[bθO n (2m0)] + E[bθE n (2m0)] + E[θl,m0] + lim K E[bθ n (2K+1)].

The expectations in the last line are all ﬁnite because Θ is compact. In addition, since the

draws are i.i.d. and θl,m0 is the solution to the problem (32) when 2m0 draws are used we

E[bθO n (2m 0 )] + E[bθE n (2m 0 )] + E[θl,m0] = 0.

Moreover, the sequence of random variables {bθ n (2K+1)} is uniformly integrable, because Θ

is a compact subset of a ﬁnite-dimensional Euclidean space. Finally, we know that

bθ n (2K+1) p θ n

as K . The uniform integrability of the sequence of estimators then implies

lim K E[bθ n (2K+1)] = E lim K bθ n (2K+1) = θ n ,

see Theorem 6.2 in Das Gupta (2008). We conclude that

E[Z(K l )] = lim K E[bθ n (2K+1)] = θ n .

Now we show that the expected computational cost of Z(K l ) is ﬁnite. In order to com-

pute Z(K) for a given K we need n 2K+1 random draws. Thus, the expected computational

Blanchet, Kang, Olea, Nguyen and Zhang

cost of Z(K l ) is

K=m0 n2K+1r(1 r)K m0 = n (2m0+1) r

K=m0 2K m0(1 r)K m0

= n (2m0+1) r

K=m0 (2(1 r))K m0.

The term above converges to

n (2m0+1) r

1 2(1 r) = n (2m0+1) r

provided that 2(1 r) < 1, which holds because we have chosen r > 1/2.

For the proof of ﬁnite variance, we intend to show that

E h K K i = O(2 2K) (40)

as K . Equation (40) guarantees that every processor generates an estimator Z(K l )

with ﬁnite variance. Since K l is a discrete random variable with probability mass function

p(K l ) = r(1 r)K m0,

E[Z(K l ) Z(K l )] =

K=m0 E h Z(K l ) Z(K l )|K l = K i p(K)

p(K) + θl,m0

p(K) + θl,m0

K=m0 E K K p(K)2

K=m0 E h θ l,m0θl,m0 i p(K)

p(K) + sup θ Θ θ 2 2p(K)

1 22m022(K m0)p(K) + sup θ Θ θ 2 2p(K)

1 r4m0 1 (4(1 r))K m0

Dropout Training is Distributionally Robust Optimal

The geometric sum in the last expression is ﬁnite because we have assumed that r < 3

To show Equation (40), we do a Taylor expansion of the ﬁrst-order conditions of the

problem (32) around θ n . The Karush Kuhn Tucker optimality condition for the level-2K

solution bθ n (2K) of the problem in expression (32) implies

k=1 θℓ(xi ξk i , yi, bθ n (2K))

It follows by Taylor expansion and Assumption 4 that

k=1 θℓ(xi ξk i , yi, θ n )

k=1 θθℓ(xi ξk i , yi, θ n )

bθ n (2K) θ n

k=1 θℓ(xi ξk i , yi, θ n )

i=1 θθEQ h ℓ(X ξ, Y, θ n )|X = xi, Y = yi i ] bθ n (2K) θ n

+RK + RK,θ, (41)

k=1 θθℓ(xi ξk i , yi, θ n ) θθEQ h ℓ(X ξ, Y, θ n )|X = xi, Y = yi i

bθ n (2K) θ n ,

i=1 sup θ Θ,ξ θθθℓ(xi ξ, yi, θ) 2 bθ n (2K) θ n 2

2 C3 bθ n (2K) θ n 2

by Assumption 4. Thus, by Assumption 3, we have

E[R K,θRK,θ] = O(2 2K)

Blanchet, Kang, Olea, Nguyen and Zhang

as K . Moreover, by the multivariate version of Theorem 2 in Bahr (1965) which

follows from the Cram er Wold theorem, we have that

k=1 θθℓ(xi ξk i , yi, θ n ) θθEQ h ℓ(X ξ, Y, θ n )|X = xi, Y = yi i

is O(2 2K).

We can express R KRK as RK 2. The Cauchy Schwarz inequality implies

k=1 θθℓ(xi ξk i , yi, θ n ) θθEQ h ℓ(X ξ, Y, θ n )|X = xi, Y = yi i

bθ n (2K) θ n 2

By H older s inequality we have

k=1 θθℓ(xi ξk i , yi, θ n ) θθEQ h ℓ(X ξ, Y, θ n )|X = xi, Y = yi i

E bθ n (2K) θ n 4

Finally, consider the solutions bθ n (2K l +1), bθO n (2K l ), bθE n (2K l ) conditional on K l = K. De-

note the remainder terms in Equation (41) corresponding to the level 2K+1 solution bθ n (2K+1)

as R K+1, R K+1,θ. Similarly, denote the remainder terms in Equation (41) corresponding to

the level-2K solution bθO n (2K) (and, respectively, bθE n (2K)) as RO K, RO K,θ ( RE K, RE K,θ). By the

construction of bθO n (2K), bθE n (2K) using odd and even indices, we have, from Equation (41)

i=1 θθEQ [ℓ(X ξ, Y, θ n )|X = xi, Y = yi] bθ n (2K+1) 1

2(bθO n (2K) + bθE n (2K))

2(RO K + RE K) + R K+1,θ 1

2(RO K,θ + RE K,θ).

Dropout Training is Distributionally Robust Optimal

By Assumption 4,

i=1 θθEQ [ℓ(X ξ, Y, θ n )|X = xi, Y = yi] = n θθEQ [ℓ(X ξ, Y, θ n )]

is invertible. Thus, we have shown that

K bθ n (2K+1) 1

2(bθO n (2K) + bθE n (2K))

= n θθEQ [ℓ(X ξ, Y, θ n )] 1 R K+1 1

2(RO K + RE K) + R K+1,θ 1

2(RO K,θ + RE K,θ) .

Since each of the terms on the right-hand side has been shown to be O(2 2K), we conclude

that E[ K K] = O(2 2K).

A.5 Dropout Training in Linear Regression

Here, diag(M) denotes the diagonal matrix formed by the diagonal elements of M.

Corollary 2 (Linear regression with φ = 1) For linear regression with ℓ(x, y, β) = (β x

y)2, we have

min β Rd max Q U(b Pn,δ) EQ h β (X ξ) Y 2i = min β Rd EQ h β (X ξ) Y 2i ,

where Q = b Pn Q 1 . . . Q d and Q j = (1 δ) 1 Bernoulli(1 δ) for each j = 1, . . . , d.

min β Rd EQ h β (X ξ) Y 2i = min β Rd 1 n

(Y Xβ) (Y Xβ) + δ 1 δβ Λβ , (42)

where Λ = diag(X X) which implies that the dropout training estimator equals

bβ(δ) = X X + δ 1 δ diag(X X) 1 X Y.

Finally, if EP [XX ] is a diagonal matrix with strictly positive entries then

bβ(δ) p (1 δ)EP [XX ] 1EP [XY ].

Proof The ﬁrst part of the corollary follows directly from (12) and (13) in our main

theorem. The second part of the corollary follows from Proposition 1. According to this

Blanchet, Kang, Olea, Nguyen and Zhang

proposition, the limit of bβ(δ) is

β (δ) = EP [XX ] + (δ/1 δ) diag(EP [XX ]) 1 EP [Y X]. (43)

Thus, if EP [XX ] is a diagonal matrix, we obtained the desired limit.

One interesting implication of Equation (43) is that β (0) which gives the probability

limit of the maximum likelihood estimator generally diﬀers from β (δ) when δ = 0, which

is the probability limit of the dropout estimator.14 In particular, the estimator in Equation

(43) diﬀers from the best linear predictor of y using x as long as EP [Y X] = 0 and δ = 0.

This estimator diﬀers from Ridge regression in that diag(EP [XX ]) replaces the identity

matrix (and this simple adjustment makes the estimator scale equivariant).

A.6 Additional Theoretical Results

A.6.1 Convexity of f is necessary for the optimality of dropout noise

Theorem 4 Let f : R2 R be a function such that f(y, ) : R R admits bounded and

continuous second-order derivatives for any given y R. If, for some δ (0, 1], all d 1,

all β Rd and all reference distributions Q0 (where (X, Y ) Q0), it holds that

sup Q U(Q0,δ) EQ[f(Y, β X ξ)] = EQ [f(Y, β X ξ)], (44)

then f(y, ) is a convex function on the real line for any given y.

Proof Since (44) holds for any reference distribution Q0, we consider the case where Q0 is given by the Dirac measure on the point (x, y), where x Rd, y R. Since dropout is

the solution to problem (44), we have that the dropout noise on the last coordinate, viz.,

Q d = (1 δ) 1 Bernoulli(1 δ), solves

max Qd Qd(δ) EQd h EQ 1 Q d 1

h f(y, β x ξ) ii .

φ(ξd) = EQ 1 Q d 1

j=1 βjxjξj + βdxdξd)

14. Relatedly, Farrell et al. (2021) who study deep neural networks and their use in semiparametric inference report that their numerical exploration of dropout increased bias and interval length compared to nonregularized models.

Dropout Training is Distributionally Robust Optimal

where φ is implicitly indexed by β, x and y. By duality in semi-inﬁnite linear programming

(see Theorem 1 and Equation 4 in Isii 1962), there exists an aﬃne function L( ) such that

φ(ξd) L(ξd) ξd [0, 1 1 δ] with equality at ξd = 0 or 1 1 δ.

Let ξd = (1 α) 1 1 δ for α [0, 1], we have that

φ((1 α) 1 1 δ) L((1 α) 1 1 δ)

= L(α 0 + (1 α) 1 1 δ)

= αL(0) + (1 α)L( 1 1 δ)

= αφ(0) + (1 α)φ( 1 1 δ).

Since β, x, y are arbitrary, this is equivalent to saying that, for all (z1, . . . , zd 1), z, y and

(i1,...,id 1) {0,1}d 1 p(i1, . . . , id 1)f(y,

j=1 zjij + (1 α)z)

(i1,...,id 1) {0,1}d 1 p(i1, . . . , id 1)f(y,

(i1,...,id 1) {0,1}d 1 p(i1, . . . , id 1)f(y,

j=1 zjij + z),

where p(i1, . . . , id 1) is the probability mass function for independent Bernoulli trials i1, . . . , id 1.

Rearranging the inequality, we have

(i1,...,id 1) {0,1}d 1 p(i1, . . . , id 1)

j=1 zjij + (1 α)z) f(y,

(i1,...,id 1) {0,1}d 1 p(i1, . . . , id 1)

j=1 zjij + z) f(y,

By second-order Taylor expansion around z = 0, we have

j=1 zjij + (1 α)z) f(y,

j=1 zjij) = fx(y,

j=1 zjij)(1 α)z + O(1)(1 α)2z2,

Blanchet, Kang, Olea, Nguyen and Zhang

j=1 zjij + z) f(y,

j=1 zjij) = fx(y,

j=1 zjij)z + 1

j=1 zjij) + o(1)

where fx, fxx denote partial derivatives of f with respect to the second argument. Therefore,

we have the former inequality becomes (by the cancellation of terms)

(i1,...,id 1) {0,1}d 1 p(i1, . . . , id 1)O(1)(1 α)

(i1,...,id 1) {0,1}d 1 p(i1, . . . , id 1)

j=1 zjij) + o(1)

Thus letting α 1, and then letting z 0, we obtain

(i1,...,id 1) {0,1}d 1 p(i1, . . . , id 1)fxx(y,

j=1 zjij) 0,

for all zj R, j.

Now we consider z1 = = zd 1 = c R and denote Sd = Pd 1 j=1 ij. Thus Sd follows

the binomial distribution B(d 1, p) where p = (1 δ). We rewrite the previous inequality

ESd B(d 1,p)[fxx(y, c Sd)] 0.

Let us consider c = c (d 1)p, and rewrite

c Sd = c (d 1)p(Sd p(d 1)) + c.

By the standard Central Limit Theorem,

d 1 N(0, p(1 p)),

where denotes convergence in law. Therefore, we have that c Sd converges to the Dirac

measure at c as d . Since fxx(y, ) is bounded and continuous, we conclude that

fxx(y, c) 0 for arbitrary c R. The convexity of f(y, ) then follows.

Dropout Training is Distributionally Robust Optimal

Corollary 3 Consider the single index models with f(y, β x) = (y g(β x))2. Then

sup Q U(Q0,δ) EQ[(Y g(β X ξ))2] = EQ [(Y g(β X ξ))2]

for all δ (0, 1], d 1, all β Rd and all reference distribution Q0 (where (X, Y ) Q0) if

and only if g is a linear function.

Proof The if part is given by Theorem 1 of the paper, as f(y, u) = (y g(u))2 is convex

if g is linear. For the only if part, we know that

fxx(y, u) = 2yg (u) + 2g(u)g (u) + 2(g (u))2 0

Suppose that g (u0) = 0 for some u0 R, then, by choosing a suﬃciently large (in absolute

value) y, we have fxx(y, u0) < 0, a contradiction. Thus g (u) = 0 for all u, which implies

that g must be linear.

A.6.2 Convexity is not sufficient for the optimality of dropout noise

Consider a slight generalization of the set U used in Theorem 1 by deﬁning

Qj(δ) = {Qj : Qj([0, (1 δ) 1]) = 1, EQj[ξj] = 1, Qj([0, ϵ]) = δ}

U(Q0, δ) = {Q0 Q1 . . . Qd : Qj Qj(δ)}.

The only diﬀerence between U(Q0, δ) and our previous construction is that we force the

random variables Qj to have positive mass on the interval [0, ϵ]. Consider the problem

sup Q U(Q0,δ) EQ[f Y, β (X ξ) ]

where f(y, ) : R R is a convex function for every y. Just as we did in the proof of

Theorem 1, we ﬁx ξ2, . . . , ξd, and try to solve

sup Q1 Q1(δ) EQ0 Q1

Y, β1X1ξ1 +

j=2 βj Xjξj

We argue that the optimal solution Q1 in general depends on the values of ξ2, . . . , ξd and,

more importantly, Q1 may fail to include zero in its support. This last point implies that

Blanchet, Kang, Olea, Nguyen and Zhang

Q1 cannot be interpreted as dropout noise (as there is no sense in which the ﬁrst variable

is ever dropped out). We make this point by showing a simple example.

Consider Q0 is the Dirac measure on (x, y) with y > 0 and consider f(y, z) = (y

z+)2 (i.e., square loss with the so-called Re Lu activation function). Now, f(y, β1x1ξ1 + Pd j=2 βjxjξj) = ((β1x1ξ1 + Pd j=2 βjxjξj)+ y)2 = β2 1x2 1((ξ1 + z)+ y)2 where we assume

β1x1 > 0. Thus, the problem is equivalent to

sup Q1 Q1(δ) [((ξ1 + z)+ y)2].

If z ( 0.99, ϵ), then by Isii (1962, Theorem 1 ), the problem has the dual

inf a,b,c a + b + cδ s.t. ((ξ1 + z)+ y)2 a + bξ1 + c1{ξ1 [0,ϵ]} ξ1 0, 1 1 δ

and moreover the optimal solution Q 1 is supported in the set

ξ1 0, 1 1 δ

: ((ξ1 + z)+ y)2 = a + b ξ1 + c 1{ξ1 [0,ϵ]}

where a , b , c constitute the solution to the dual problem. We claim that 0 is not in the

support of Q 1. Otherwise, ξ1 = 0 need to satisfy

((ξ1 + z)+ y)2 = a + b ξ1 + c 1{ξ1 [0,ϵ]}.

Since the left-hand side is constant for ξ1 in a neighborhood of 0, we have that b

0. Since 0.99 < z < ϵ, we have that, for ξ1 > 0.99, ((ξ1 + z)+ y)2 < y2 while

a + b ξ1 + c 1{ξ1 [0,ϵ]} > y2. Hence, the support of Q 1 is a subset of [0, 0.99], contradicting

the requirement that Q 1 has mean 1.

We conclude that, while convexity is somewhat necessary for dropout to be optimal, it

need not be suﬃcient.

A.6.3 Block Dropout

Finally, we generalize the notion of dropout so that the noise distributions Q1, . . . , Qd are

arbitrarily correlated. Thus, we deﬁne

U(Q0, δ) = {Q0 Q1:d : Q1:d([0, (1 δ) 1]d) = 1, EQ1:d[ξ] = (1, . . . , 1) }

and consider the problem

sup Q U(Q0,δ) EQ[f(Y, β X ξ)].

Dropout Training is Distributionally Robust Optimal

We give an example where this is a meaningful generalization of the notion of dropout.

Consider f(y, z) = (y z)2 (i.e., square loss), also that Q0 is a Dirac measure on (x, y).

Then the problem

sup Q U(Q0,δ) EQ1:d

i=1 xiβiξi)

is equivalent to

sup Q U(Q0,δ) EQ1:d

By Theorem 1 in Isii (1962), the problem has dual

inf a0,...,ad

i=0 ai s.t.

i=1 aiξi ξ [0, (1 δ) 1]d

and, moreover, the optimal solution Q 1:d is supported in the set

ξ [0, (1 δ) 1]d :

where a 0, . . . , a d are the optimal dual variables. Suppose that xiβi = 0 for all i. Then,

since Pd i=1 xiβiξi 2 is convex quadratic, any ξ in the support points set (45) must be

in the boundary of the cube [0, (1 δ) 1]d. Moreover, since we have the mean constraint

EQ 1:d[ξ] = (1, . . . , 1) , there exists ξ in (45) with blocks of components comprising zeros.

The above arguments extend to any Q0 and f as long as the function EQ0[f(Y, β X ξ)]

is a strictly convex function of ξ. Solving precisely the support of the noise requires solving

for the dual variables a i , which is a semi-inﬁnite linear programming (ﬁnite-dimensional lin-

ear objective with inﬁnitely many constraints). The numerical methods in the literature ap-

proximate the semi-inﬁnite problem with a sequence of ﬁnite programming problems, which

are then solved by applying appropriate linear or nonlinear programming algorithms Het-

tich and Kortanek (1993). Since the computation of the noise relies on the value of β, we

suspect the overall computational burden will be large (as one ﬁrst needs to solve for the

worst-case block dropout noise, and then solve for the optimal β).

A.6.4 Additive Perturbations in the Linear Regression Model

Let ξ (ξ1, . . . , ξd) be deﬁned as a d-dimensional vector of random variables that are

independent of (X, Y ). We now perturb the distribution Q0 by considering the transforma-

Blanchet, Kang, Olea, Nguyen and Zhang

(X, Y ) 7 (X1 + ξ1, . . . , Xd + ξd, Y ) .

As a result, each covariate Xj is distorted in an additive fashion by ξj. We abbreviate

(X1 + ξ1, . . . , Xd + ξd) by X + ξ.

We restrict the distribution of ξ in the following way. First, for a parameter λ [0, ),

we deﬁne Qj(λ) to be the set of distributions for ξj that have mean 0 and variance EQ0[x2 j].

More speciﬁcally,

Qj(δ) Qj : Qj is a probability distribution on R, EQj[ξj] = 0, EQj[ξ2 j ] λEQ0[x2 j] .

Consider now the joint random vector (X, Y, ξ) Rd R Rd. For a constant λ [0, )

consider the joint distributions over (X, Y, ξ) deﬁned by

U(Q0, λ) = {Q0 Q1 . . . Qd : Qj Qj(λ) j = 1, . . . , d} ,

where denotes the product measure (meaning that the joint distribution is the product

of the independent marginals Qj, j = 0, . . . , d).

For the linear regression model, for any Q U(b Pn, λ) we have

EQ h (Y (X + ξ) β)2i = Eb Pn (Y Xβ)2 +

j=1 β2 j EQj[ξ2 j ],

Eb Pn (Y Xβ)2 +

j=1 β2 j λEb Pn[x2 j],

h (Y Xβ) (Y Xβ) + λβ Λβ i ,

where Λ = diag(X X). Therefore,

sup Q U(b Pn,λ) EQ h (Y (X + ξ) β)2i = 1

h (Y Xβ) (Y Xβ) + λβ Λβ i .

Corollary 2 implies that, for the linear regression model, the distributionally robust es-

timator to additive perturbations equals the dropout estimator with dropout probability

λ/(1 + λ). This means that the dropout estimator remains distributionally robust over the

set of additive and multiplicative perturbations of the empirical distribution:

U(b Pn, δ) U(b Pn, δ/(1 δ)).

Dropout Training is Distributionally Robust Optimal

A.7 Additional Numerical Results

Here, we outline an empirical rationale behind our parameter choices.

A.7.1 Learning Rate

We ﬁrst ﬁx an all-zeros initialization scheme, and vary the learning rate. We summarize

the average parameter divergence and 1-standard deviation error for 20 repetitions of the

SGD algorithm in Table 2. We can observe that the learning rate 0.0001 shows a clear

advantage. We plot the average parameter divergence and 1-standard deviation error for

the SGD trajectory up to the 180 s wall-clock time in Figure 4, which demonstrates that

the divergence saturates with the selected learning rate (also see Figure 5 for detail between

the 30 s and 180 s wall-clock times).

A.7.2 Initialization

Next, we ﬁx the learning rate to be 0.0001, and consider diﬀerent initialization schemes.

We note that the mean value (resp., absolute value) of elements in β n is 0.3947 (resp.,

0.6977). Table 3 shows the average parameter divergence and the 1-standard deviation

from 20 repetitions of the SGD algorithm. We see that the initialization at origin is a fair

A.7.3 Naive Monte Carlo with a fixed K

We compare to the na ıve Monte Carlo implementation with a ﬁxed K, where we do gra-

dient descent on objective (32). The gradient-descent learning rate is searched over the

grid {10 i, i = 0, 1, . . .}. We summarize the average parameter divergence and 1-standard

deviation error for 20 repetitions of the approach (with the best learning rate) in Table 4.

Note that, for small K, the objective (32) has a high bias, while for large K, there are fewer

gradient descent steps completed within 60 s due to heavy computational burdens.

Learning rate 0.1 0.01 0.001 bβSGD β n 1.1026 0.1705 0.2717 0.0403 0.0827 0.0133 Learning rate 0.0001 0.00001 0.000001 bβSGD β n 0.0301 0.0025 0.6702 0.1082 1.7202 0.0044

Table 2: Comparison for diﬀerent learning rates, with ﬁxed zero initializations.

A.7.4 Wall-Clock Time

We document the numerical results for 120 s/180 s wall-clock time; see Figures 6 8 for the

case of 120 s, and Figures 9 11 for the case of 180 s. We see that the proposed unbiased

Blanchet, Kang, Olea, Nguyen and Zhang

Figure 4: SGD trajectory up to 180 s wall-clock time

Figure 5: SGD trajectory between 30 s and 180 s wall-clock time

approach outperforms the standard SGD when the number of parallel iterations reaches

above some threshold.

Initializations all zeros all 0.2 s all 1 s bβSGD β n 0.0301 0.0025 0.0317 0.0047 0.0614 0.0196 Initializations i.i.d. N(0, 1) i.i.d. N(0, 10) i.i.d. N(0, 102) bβSGD β n 0.0376 0.0067 0.1006 0.0469 0.3208 0.1432

Table 3: Comparison for diﬀerent initialization schemes with ﬁxed learning rate 0.0001.

Dropout Training is Distributionally Robust Optimal

Figure 6: l2 diﬀerence for 120 s wall-clock time

Figure 7: l diﬀerence for 120 s wall-clock time

Figure 8: l1 diﬀerence for 120 s wall-clock time

Blanchet, Kang, Olea, Nguyen and Zhang

Figure 9: l2 diﬀerence for 180 s wall-clock time

Figure 10: l diﬀerence for 180 s wall-clock time

Figure 11: l1 diﬀerence for 180 s wall-clock time

Dropout Training is Distributionally Robust Optimal

K 25 210 215

bβnaive MC β n 0.7370 0.1181 0.1281 0.0231 0.7417 0.0046

Table 4: Comparison with na ıve Monte Carlo with a ﬁxed K.

S Alan, O Attanasio, and M Browning. Estimating Euler equations with noisy data: two

exact GMM estimators. Journal of Applied Econometrics, 24(2):309 324, 2009.

B Von Bahr. On the convergence of moments in the central limit theorem. Annals of

Mathematical Statistics, 36(3):808 818, 06 1965.

C M Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Compu-

tation, 7(1):108 116, 1995.

J Blanchet, P Glynn, and Y Pei. Unbiased multilevel Monte Carlo: Stochastic opti-

mization, steady-state simulation, quantiles, and other applications. ar Xiv preprint

ar Xiv:1904.09929, 2019a.

J Blanchet, Y Kang, and K Murthy. Robust Wasserstein proﬁle inference and applications

to machine learning. Journal of Applied Probability, 56:830 857, 2019b.

S Boyd and L Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

T Christensen and B Connault. Counterfactual sensitivity and robustness. ar Xiv preprint

ar Xiv:1904.00989, 2019.

A Das Gupta. Asymptotic Theory of Statistics and Probability. Springer Verlag, 2008.

E Delage and Y Ye. Distributionally robust optimization under moment uncertainty with

application to data-driven problems. Operations Research, 58(3):595 612, 2010.

T Devries and G W Taylor. Improved regularization of convolutional neural networks with

cutout. Co RR, abs/1708.04552, 2017.

D Draper. Assessment and propagation of model uncertainty. Journal of the Royal Statistical

Society, Series B, 56, 1994.

J Duchi and H Namkoong. Variance-based regularization with convex objectives. Journal

of Machine Learning Research, 20(1):2450 2504, 2019.

R Durrett. Probability: Theory and Examples. Cambridge University Press, 2019.

Blanchet, Kang, Olea, Nguyen and Zhang

L Fahrmeir and H Kaufmann. Consistency and asymptotic normality of the maximum

likelihood estimator in generalized linear models. The Annals of Statistics, pages 342

M H Farrell, T Liang, and S Misra. Deep neural networks for estimation and inference.

Econometrica, 89(1):181 213, 2021.

T Ferguson. Mathematical Statistics: A Decision Theoretic Approach, volume 7. Academic

Press New York, 1967.

M B Giles. Multilevel Monte Carlo path simulation. Operations Research, 56(3):607 617,

M B Giles. Multilevel Monte Carlo methods. Acta Numerica, 24:259, 2015.

I Goodfellow, Y Bengio, and A Courville. Deep Learning. MIT Press, 2016.

L P Hansen and T J Sargent. Robustness. Princeton University Press, 2008.

D P Helmbold and P M Long. On the inductive bias of dropout. The Journal of Machine

Learning Research, 16(1):3403 3454, 2015.

R Hettich and K Kortanek. Semi-inﬁnite programming: Theory, methods, and applications.

SIAM Review, 35(3):380 429, 1993.

G E Hinton, N Srivastava, A Krizhevsky, I Sutskever, and R R Salakhutdinov. Improv-

ing neural networks by preventing co-adaptation of feature detectors. ar Xiv preprint

ar Xiv:1207.0580, 2012.

K Hornik, M Stinchcombe, and H White. Multilayer feedforward networks are universal

approximators. Neural Networks, 2(5):359 366, 1989.

J T Hwang. Multiplicative errors-in-variables models with applications to recent data re-

leased by the US Department of Energy. Journal of the American Statistical Association,

81(395):680 688, 1986.

K Isii. On sharpness of Tchebycheﬀ-type inequalities. Annals of the Institute of Statistical

Mathematics, 14(1):185 197, 1962.

J Kim and W Winkler. Multiplicative noise for masking continuous data. Statistics, 1:9,

R H Lyles and L L Kupper. A detailed evaluation of adjustment methods for multiplicative

measurement error in linear regression with applications in occupational epidemiology.

Biometrics, pages 1008 1025, 1997.

Dropout Training is Distributionally Robust Optimal

L Maaten, M Chen, S Tyree, and K Weinberger. Learning with marginalized corrupted

features. In International Conference on Machine Learning, pages 410 418, 2013.

P Mc Cullagh and J A Nelder. Generalized Linear Models. Chapman & Hall, 1989.

T K Nayak, B Sinha, and L Zayatz. Statistical properties of multiplicative noise masking

for conﬁdentiality protection. Journal of Oﬃcial Statistics, 27(3):527, 2011.

A Nemirovski, A Juditsky, G Lan, and A Shapiro. Robust stochastic approximation ap-

proach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609,

W K Newey and D Mc Fadden. Large sample estimation and hypothesis testing. Handbook

of Econometrics, 4:2111 2245, 1994.

V A Nguyen, X Zhang, J Blanchet, and A Georghiou. Distributionally robust parametric

maximum likelihood estimation. In Advances in Neural Information Processing Systems

D S Park, W Chan, Y Zhang, C-C Chiu, B Zoph, E D Cubuk, and Q V Le. Spec Augment: A

simple data augmentation method for automatic speech recognition. In Proc. Interspeech

2019, pages 2613 2617, 2019.

D A Pierce, D O Stram, M Vaeth, and D W Schafer. The errors-in-variables problem:

considerations provided by radiation dose-response analyses of the a-bomb survivor data.

Journal of the American Statistical Association, 87(418):351 359, 1992.

A E Raftery, D Madigan, and J A Hoeting. Bayesian model averaging for linear regression

models. Journal of the American Statistical Association, 92(437):179 191, 1997.

H Rahimian and S Mehrotra. Distributionally robust optimization: A review. ar Xiv preprint

ar Xiv:1908.05659, 2019.

H Robbins and S Monro. A stochastic approximation method. The Annals of Mathematical

Statistics, pages 400 407, 1951.

H Scarf. A min max solution of an inventory problem. Studies in the Mathematical Theory

of Inventory and Production, 10:201 209, 1958.

J Schmidt-Hieber. Nonparametric regression using deep neural networks with Re LU acti-

vation function. The Annals of Statistics, 48(4):1875 1897, 2020.

A Shapiro. Distributionally robust stochastic programming. SIAM Journal on Optimization,

27(4):2258 2275, 2017.

Blanchet, Kang, Olea, Nguyen and Zhang

A Shapiro, D Dentcheva, and A Ruszczy nski. Lectures on Stochastic Programming: Model-

ing and Theory. SIAM, 2014.

N Srivastava, G Hinton, A Krizhevsky, I Sutskever, and R Salakhutdinov. Dropout: a

simple way to prevent neural networks from overﬁtting. The Journal of Machine Learning

Research, 15(1):1929 1958, 2014.

S Wager, S Wang, and P S Liang. Dropout training as adaptive regularization. In Advances

in Neural Information Processing Systems 26, pages 351 359. 2013.

M J Wainwright and M I Jordan. Graphical Models, Exponential Families, and Variational

Inference. Now Publishers Inc, 2008.

S Wang and C Manning. Fast dropout training. In International Conference on Machine

Learning, pages 118 126, 2013.

C Wei, S Kakade, and T Ma. The implicit and explicit regularization eﬀects of dropout. In

International Conference of Machine Learning, 2020.

W Wiesemann, D Kuhn, and M Sim. Distributionally robust convex optimization. Opera-

tions Research, 62(6):1358 1376, 2014.

M Zinkevich, M Weimer, L Li, and A J Smola. Parallelized stochastic gradient descent. In

Advances in Neural Information Processing Systems, pages 2595 2603, 2010.