# dropout_enhanced_bilevel_training__75916ca6.pdf

Published as a conference paper at ICLR 2024

DROPOUT METHODS FOR BILEVEL TRAINING TASK

Peiran Yu Department of Computer Science University of Maryland College Park, MD 20740, USA {pyu123}@umd.edu

Junyi Li Department of Computer Science University of Maryland College Park, MD 20740, USA {junyili.ai}@gmail.com

Heng Huang Department of Computer Science University of Maryland College Park, MD 20740, USA {henghuanghh}@gmail.com

Bilevel optimization problems appear in many widely used machine learning tasks. Bilevel optimization models are sensitive to small changes, and bilevel training tasks typically involve limited datasets. Therefore, overfitting is a common challenge in bilevel training tasks. This paper considers the use of dropout to address this problem. We propose a bilevel optimization model that depends on the distribution of dropout masks. We investigate how the dropout rate affects the hypergradient of this model. We propose a dropout bilevel method to solve the dropout bilevel optimization model. Subsequently, we analyze the resulting dropout bilevel method from an optimization perspective. Analyzing the optimization properties of methods with dropout is essential because it provides convergence guarantees for methods using dropout. However, there has been limited investigation in this research direction. We provide the complexity of the resulting dropout bilevel method in terms of reaching an ϵ stationary point of the proposed stochastic bilevel model. Empirically, we demonstrate that overfitting occurs in data cleaning and meta-learning, and the method proposed in this work mitigates this issue.

1 INTRODUCTION

Bilevel optimization appear in many problems widely used in machine learning tasks such as data cleaning (Shaban et al., 2019), hyperparameter optimization, meta-learning (Franceschi et al., 2018) and reinforcement learning (Yang et al., 2019). The bilevel optimization problem involves two minimization problems that are stacked on top of each other, where the solution to one optimization problem depends on the solution of the other. Take data cleaning as an example. Suppose we have a corrupted training data set Dtr with Ntr data points and a clean data set Dval with Nval data points. Let f(w; ξ) be a network parametrized by w. The aim of data cleaning is to train the model with corrupted data by solving

min λ RNtr 1 |Nval|

i=1 l(f(w(λ); ηi); ηi) s.t.w(λ) Arg min w 1 |Ntr|

i=1 σ(λi)l(f(w; ξi); ξi) (1)

where ηi Dval, ξi Dtr, l is the loss function and σ(λi) is the weight of the loss w.r.t. the data ξi. If ξi is corrupted, we hope σ(λi) can neutralize the influence of ξi.

Another important machine learning task using bilevel optimization is the meta-learning (Franceschi et al., 2018; Lorraine et al., 2020b). Suppose we have N different training tasks. The jth task

Use footnote for providing further information about author (webpage, alternative address) not for acknowledging funding agencies. Funding acknowledgements go at the end of the paper.

Published as a conference paper at ICLR 2024

(c) CIFAR10

Figure 1: Overfitting in data cleaning tasks

has a pair of training and validation data sets {(Dj,v, Dj,tr)}. The j-th task is to train a network f(wj, λ; ξ), where ξ is an input, wj denotes the parameters for task j and λ denotes the parameters shared for all N tasks. The goal of meta-learning is to find good shared parameters λ. To this end, the following bilevel model is considered:

j=1 lj(wj, λ; Dj,v) s.t. w(λ) Arg min w:=(w1,...,w N)

j=1 lj(wj, λ; Dj,tr), (2)

where lj(wj, λ; D) is the loss of the network f for the jth task on a data set D.

As shown in (1) and (2), bilevel training tasks are more complex than single level training tasks. Thus, bilevel optimization models can be sensitive to changes in the input data and parameters. In addition, bilevel training tasks usually suffer from limited data sets. As evidenced by Figure 1, in the data cleaning problem, overfitting happens since the increase in the training accuracy leads to a decrease in the testing accuracy of the classifier. Regarding meta-learning, overfitting in the model learned through bilevel optimization can result in a diminished ability to learn new tasks. As shown in our experiments section, in classification problems, the training accuracy is high while the testing accuracy is not. Therefore, overfitting is a common challenge in bilevel training tasks and we need to addressing it.

In single level machine learning optimization problems such as empirical risk minimization, there are many ways to redeem the overfitting problem, including dropout, early stopping, data augmentation, etc. Among these methods, dropout is highly effective, simple to apply and computationally efficient, (Srivastava et al., 2014; Labach et al., 2019). Proposed by Srivastava et al. (2014), dropout randomly drops neurons of the neural network during training, leading to only a portion of parameters being updated in each epoch.Dropout has been successfully applied to fully connected networks (Ba & Frey, 2013; Srivastava et al., 2014), convolutional neural networks (CNNs) (Wu & Gu, 2015; Srivastava et al., 2014; Park & Kwak, 2017), and recurrent layers in recurrent neural networks (RNNs) (Pham et al., 2014; Zaremba et al., 2014). In this work, we investigate how dropout behaves when adopted to solving the bilevel optimization problem.

Since the dropout only randomly zero out part of the neurons in a network, we propose a bilevel optimization model that characterizes this randomness. Since current existing bilevel methods are not directly applicable to this new model, we use a representative bilevel method from Li et al. (2022) as an example to see how existing bilevel methods can be adapted to this new model. We then investigate theoretical convergence guarantees of the resulting dropout bilevel method. Existing analysis of the dropout method from an optimization perspective is very limited. As far as we search, only (Li et al., 2016; Mianjy & Arora, 2020; Senen-Cerda & Sanders, 2020; Senen-Cerda & Sanders, 2022) investigated this direction previously. However, these works only focused on single level optimization problems. For bilevel problems, there are no theoretical convergence guarantees for dropout methods. In this work, we fill in this gap and study the convergence properties for a dropout bilevel method. The challenges in analyzing the dropout bilevel method, which the singlelevel model considered in (Senen-Cerda & Sanders, 2020) does not face, include how to analyze the hypergradient. Analyzing the hypergradient is hard to analyze in two ways: 1. it is related to the solution of the lower level problem; 2. both upper and lower level objectives are composed with random dropout masks.

Our contributions are summarized as follows:

We form a statistical bilevel optimization problem that includes the distribution of the dropout masks. To solve this problem, we propose a dropour variant of an existing method

Published as a conference paper at ICLR 2024

from (Li et al., 2022) as an example of a bilevel method. We investigate the inductive relations between the variables that are used in the dropout bilevel method to approximate the hypergradients of the outer-level objective. These variables are affected by the dropout masks, and the variance term in the inductive relations is affected by the dropout rates.

We show that the complexity of the dropout bilevel method depends on the dropout rate. Unlike the complexity of the bilevel methods for the bilevel model without dropout, the complexity of the dropout bilevel method for the dropout bilevel model has additional errors that are only brought by the dropout. The challenge in this analysis is how to handle the distributions of dropout masks that appear when estimating the error of the hypergradient of the upper-level objective.

We apply the proposed method to data cleaning problems. In our experiments, we observe the overfitting problem. In addition, we observe that the proposed dropout bilevel method redeems the overfitting problem. We also observe that the dropout bilevel method converges. Furthermore, we observed that accuracy changes across iterations are more stable when using dropout.

1.1 RELATED WORK

Theoretical properties of dropout methods for single-level optimization problems have been investigated in (Baldi & Sadowski, 2013; Gal & Ghahramani, 2016; Zhai & Wang, 2018; Li et al., 2016). In particular, Baldi & Sadowski (2013) introduced a general formalism for studying dropout and use it to analyze the averaging and regularization properties of dropout. Gal & Ghahramani (2016) and (Gal & Ghahramani, 2016) formed dropout training as approximate Bayesian inference in deep Gaussian processes and studied the model uncertainty. (Zhai & Wang, 2018; Gao & Zhou, 2016; Wang et al., 2019) studied Rademacher complexity bound of the dropout method.

Li et al. (2016); Mianjy & Arora (2020); Senen-Cerda & Sanders (2020); Senen-Cerda & Sanders (2022) investigated the convergence of the dropout method for single level risk minimization tasks. Li et al. (2016) view dropout as a data-dependent regularizer added to the training loss. In contrast, our approach involves forming a stochastic minimization model that treats dropout masks as a random linear transformation within the training loss. Mianjy & Arora (2020) considered a dropout method for a 2-layers network. When the loss function is logistic loss (convex) and the activation function is Relu, they provided the convergence rate of testing error. In this work, we consider general multi-layer neural networks and investigate training error. In addition, they assume data distribution is separable by a margin in a particular Reproducing Kernel Hilbert space, while we do not have assumptions on the data distribution. (Senen-Cerda & Sanders, 2020) is closely related to our work. They studies the convergence properties of dropout stochastic gradient methods for minimizing the empirical risks of multiple fully connected networks. Different from (Li et al., 2016; Mianjy & Arora, 2020; Senen-Cerda & Sanders, 2020; Senen-Cerda & Sanders, 2022), we focus on bilevel training tasks.

Popular methods to solve bilevel optimization problems in machine learning have been proposed in (Franceschi et al., 2017; 2018; Finn et al., 2017a; Li et al., 2022; Gould et al., 2016; Lorraine et al., 2020a; Bae & Grosse, 2020). The major of them are gradient-based methods. They can be further divided into two types based on the way they approximate the hypergradient. The first type is iterative differentiation (ITD) (Franceschi et al., 2017; 2018; Finn et al., 2017a; Liu et al., 2020; Ghadimi & Wang, 2018; Ji et al., 2021; Rajeswaran et al., 2019), and the second type is approximate implicit differentiation (AID) (Chen et al., 2022; Ji et al., 2021; Li et al., 2022; Gould et al., 2016; Lorraine et al., 2020a). As far as we know, there is no existing work in bilevel optimization that considers the overfitting problem, let alone analyzing the dropout bilevel method for this problem. In this work, we select an ITD method with a relatively simple structure as an example to investigate how dropout affects bilevel training tasks.

2 PRELIMINARIES

In this paper, we denote Rn the n-dimensional Euclidean space with inner product , and Euclidean norm . We denote the spectrum norm of a matrix A Rn m as A and the Frobenius norm of A as A F . For any matrices A and B, we denote trace(AT B) := A, B . For

Published as a conference paper at ICLR 2024

a ramdom variable ξ defined on a probability space (Ξ, Σ, P), we denote its expectation as Eξ. Given an event A and a function f of ξ, the conditional expectation of ξ is denoted as Eξ|Af(ξ). For a function F : Rn Rm, we denote the function F(x, y) with respect to y for a fixed x as F(x, ) and denote the function F(x, y) with respect to x for a fixed y as F( , y). For a differentiable function f, we say x is an ϵ stationary point of f if f(x) ϵ. For a twice differential function F : Rm Rn R, we denote x F(x, y) and y F(x, y) as the partial gradients F (x,y)

y and F (x,y)

y correspondingly. We denote 2F(x, y) as the Hessian matrix of F,

xy F(x, y) := 2F (x,y)

x y and yy F(x, y) := 2F (x,y)

y y . For a multiple valued differentiable function f : Rn Rm, we denote its Jacobian at x as J(f(x)). With a little abuse of notation, given a distribution P, let g(x; ξ) be a function that depends on x Rn and a data point ξ P.

In general, a bilevel optimization training task is formed as follows: min λ Rnλ F(λ, w(λ)) := F(λ, w(λ); Dval), s.t. w(λ) arg min w Rnw G(λ, w) := G(λ, w; Dtr), (3)

where Dval is a validation data set, Dtr is a training data set, F : Rnλ R and G : Rnλ+nw R are differentiable functions1.

As mentioned in Section 2, there are various approaches to solve (3). Due to the nested structure of bilevel problems, the methods for (3) are inherently complex. To better understand how dropout is implemented in methods for (3), we use the fully single loop algorithm (FSLA) proposed in (Li et al., 2022) as an example to investigate dropout. This method has relatively simple formulas at each iteration and low computational cost per iteration. Step 6 in FSLA is one step of SGD for the lower level problem and Step 9 generates an approximation of the hypergradient of the objective in the upper level.

2.1 DROPOUT

Let f(w; ξ) be an l-layer network, where w is the weight and ξ is an input data. In particular, we assume f(w; ξ) := f(w; ξ) = al(wl(al 1(wl 1 a1(w1ξ)))), (4) where wi Rni Roi is the parameters in the ith layer2, ai is the activation function in the ith layer. We let w = ((w1)1, . . . , (w1)n1, . . . , (wl)1, . . . , (wl)nl) be the collection of all rows in all weight matrices, where (wi)s is the transpose of the jth row of wi. Thus w is an no-dimensional vector, where n = P ni, o = P oi . Let l be a loss function. When training f with loss function l, the forward pass is F(w; ξ) = l(f(w; ξ); ξ). (5) In the ith layer, the forward pass of an input ξi Rni using dropout can be formed as ri = mi ai(wiξi), (6) where ξ0 = ξ, represent the Hadamard product of two vectors, and mi Rni with (mi)j Bernoulli(pi) with pi [0, 1]. mj is called a dropout mask at the ith layer. Note that this formation of dropout masks is more general than the original dropout masks proposed in Srivastava et al. (2014). In Srivastava et al. (2014), the dropout rate is the same for all rows of weights in the same layer. Here, we allow the drop rate of each row of the weight matrix to be different.

Note that the function values of many well known activation functions at 0 is 0. For example, tanh, centered sigmoid and Relu has this property. Through out this work, we make the following assumption. Assumption 1. For any ai used in (4), ai(0) = 0.

Under assumption 1, ri in (6) can be alternatively formed as

ri = ai(mi (wiξi)) = ai

(mi)1 0 0 ... ... ... 0 . . . (mi)ni

1With a little abuse of notation, without misunderstanding, we denote the F(λ, w(λ); Dval) as a function F(λ) and denote G(λ, w; Dtr) as a function G(λ, w). 2For simplicity, we include the parameters of bias in wi with a corresponding fixed input of 1.

Published as a conference paper at ICLR 2024

Denote diag((mi)1, . . . , (mi)ni) :=

(mi)1 0 0 ... ... ... 0 . . . (mi)ni

. Recalling that for any layer i,

(mi)j Bernoulli(pi). The right hand formula in (7) means the dropout method randomly pick the jth row of the weight matrix wi with probability pi and sets the rest to 0. Thus, given a loss function l, under Assumption 1, the forward pass with dropout in fact calculates

l(f(mw; ξ); ξ) = F(mw; ξ),

where F is defined in (5) and

m := diag ((m1)1, . . . , (m1)n1, . . . , (ml)1, . . . , (ml)nl) . (8)

As we can see, the jth diagonal element mjj Bernoulli(pi) when j {Pi 1 k nk + 1, . . . , Pi k nk}. Note that in above discussion, each row of the weight matrix in each layer is always viewed as one element. For notation simplicity, without loss of generality, in the rest of this paper, when we mention the weight of a network, we view the weight matrix in the ith layer as a vector of Rni, where ni is the number of outputs in the ith layer. Then, we view w, the collection of all rows of all layers, as a vector in Rn.

3 DROPOUT IN BILEVEL PROBLEMS

As introduced in Section 2.1, the objectives are the composition of them with the dropout masks. Thus, we propose the following variant of (3) that considers the distribution of the dropout masks:

min λ Rnλ F(λ) := Emλ,mw F(mλλ, mww(λ)),

s.t. w(λ) arg min w Rnw G(λ, w) := Emλ,mw G(mλλ, mww), (9)

where F and G are the same as in (3), mλ and mw are random diagonal matrices with (mλ)ii Bernoulli(pi,λ), (mw)ii Bernoulli(pi,w) respectively. This idea of modeling with the dropout masks is also considered in analyzing dropout SGD for the single level training tasks in (Senen Cerda & Sanders, 2020). For bilevel learning tasks, this model is new. We add dropout masks on both upper and lower level objective functions to include the application where λ and w are both weights of a network, Franceschi et al. (2018); Z ugner & G unnemann (2019); Finn et al. (2017a); Snell et al. (2017).

Now we adopt the dropout bilevel method. based on FSLA. Before presenting a general form of FSLA with dropout, let s consider applying it to the example mentioned in (1). The data cleaning problem (1) is a case of (3) with F(λ, w(λ); Dval) = 1 |Nval| PNval i=1 l(f(w(λ); ηi); ηi) and

G(λ, w; Dtr) = 1 |Ntr| PNtr i=1 σ(λi)l(f(w; ξi); ξ). As we illustrate in Section 2.1, suppose m is a dropout masks added in the forward pass of the network used in (1) in each iteration. Then the forward pass of the upper level objective used in FSLA becomes F(λ, mw(λ); Dval), where m is defined in (8). The forward pass of the lower level objective used in FSLA becomes G(λ, mw; Dtr). By the chain rule, the backward pass needed in FSLA calculate the follows:

F(( ), mw( ); Dval)(λ) = m F(( ), mw( ); Dval)(λ); G( , m( ); Dtr)(λ, w) = m G(λ, mw; Dtr);

2G( , m( ); Dtr)(λ, w) = m 2G(λ, mw; Dtr)m. (10)

Based on this, we present Algorithm 1. We add a projection in step 8 to avoid v blow up. This is also important in the theoretical analysis. In the next section, we analyze the convergence of Algorithm 1.

4 ANALYSIS OF ALGORITHM 1

A challenge of analyzing (9), which the single level model considered in (Senen-Cerda & Sanders, 2020) does not have, is how to analyze the hypergradient F(λ). F(λ) is hard to analyze due to

Published as a conference paper at ICLR 2024

Algorithm 1 FSLA with dropout

1: Input: β, α, γ > 0, λ0, w0, dropout rates {pi,λ}nλ i=1, {pi,w}nw i=1, r > 0, F(λ0, w0, v 1))). 2: for k = 0, . . . , T. do 3: αk = δ

k; γk = γαk; βk = βαk; ηk = ηαk.

4: Generate random diagonal matrices mk λ, mk w, mk λ and mk w with (mk λ)ii, (mk λ)ii Bernoulli(pi,λ), (mk w)ii, (mk w)ii Bernoulli(pi,w). 5: Sample ξk Dtr, ηk Dval. 6: wk+1 = wk γkmk w w G(mk λλk, mk wwk; ξk). 7: vk+1 = βmk w w F(mk λλk, mk wwk+1; ηk) + I βkmk w 2 ww G(mk λλk, mk wwk; ξk)mk w vk. 8: vk+1 = Proj B(0,r) vk+1.

9: F(λk,wk+1,vk+1)=mk λ λF(mk λλk,mwwk+1;ηk) mλk 2 λw G(mk λλk, mk wwk+1;ξk)mk wvk+1.

10: dk+1 = F(λk, wk+1, vk+1) + (1 ηk)(dk F(λk 1, wk, vk))). 11: λk+1 = λk αkdk+1. 12: end for

its relation with the solution of the lower level problem and the fact that both upper and lower level objectives are composed with random dropout masks. To analyze this and Algorithm 1 for (9), we first make the following assumptions that are standard in bilevel optimization literature. Assumption 2. Let F(λ, w) be defined as in (3). Suppose the following assumptions hold:

(i) Suppose F is Lipschitz continuous with modulus LF .

(ii) For any fixed λ, λF(λ, ) and w F(w, ) are Lipschitz continuous with modulus LF 12 and LF 22.

(iii) There exists CF > 0 such that max{ w F(λ, w) , λF(λ, w) } CF for any λ and w.

(iv) For any fixed w, w F( , w) is Lipschitz continuous with modulus LF wλ > 0. Assumption 3. Let G(λ, w) be defined as in (3). Suppose the following assumptions hold:

(i) G is twice continuously differentiable. G(λ, w) is Lipschitz continuous with modulus LG.

(ii) For any λ, G(λ, ) is strongly convex with modulus µ.

(iii) For any λ, 2 ij G(λ, ) is Lipschitz continuous with modulus LGw.

(iv) For any w, 2 i,j G( , w) is Lipschitz continuous with modulus LGλ.

(v) There exists CG such that maxi,j{ 2 ij G(λ, w) , 2 ij G(λ, w) } CG for any λ and w.

Denote M = mλ 0 0 mw

and FM(λ) := F(mλ( ), mww( ))(λ). Then

F(λ) = Emλ,mw FM(λ). (11)

Under Assumptions 2 and 3, it is easy to see that using the chain rule,

FM(λ) = mλ λF(mλλ, mww(λ)) + J(w(λ))T mw w F(mλλ, mww(λ)), (12)

where J(w(λ)) is the Jacobian of w(λ). The next proposition gives a closed form of J(w(λ)) and its property.

Proposition 1. Consider (9) and suppose Assumption 3 holds. Denote W = λ w

and M = mλ 0 0 mw

. Let pw = mini pi,w. Then G is strongly convex with modulus pwµ and the following

equalities hold:

G(λ, w) = EM G (MW) ; 2 G(λ, w) = EM 2G (MW) M;

Published as a conference paper at ICLR 2024

J(w(λ)) = 2 ww G(λ, w(λ)) 1 2 λw G(λ, w(λ)). (13)

In addition, for any λ, it holds that w(λ) is Lipschitz continuous with some modulus Lw. Also, for any w, it holds that J(λ(w)) F CG

Based on Proposition 1, we can estimate the variance of G(λ, w). Lemma 1. Let G be defined as in (3). Let mλ and mw are diagonal matrices with (mλ)ii Bernoulli(pλ,i) and (mw)jj Bernoulli(pw,j). Let pi = pi,λ if i {1, . . . , nλ} and pi = pi,w if i {nλ + 1, . . . , nλ + nw}. Then

E 2G (M( )) (W) EM 2G (M( )) (W) 2 X

i,j pipj (1 pipj) C2 G. (14)

Next, we show how the update of the lower level parameters behaves. Since Algorithm 1 uses the stochastic gradient and Hessian. We add the following assumptions. Assumption 4. Consider (9). Let ξ be randomly picked from Dtr and η be randomly picked from Dval. Suppose Eη λF(λ, w; η) = λF(λ, w) and Eξ w G(λ, w; ξ) = w G(λ, w). Suppose Eη λF(λ, w; η) λF(λ, w) 2 σ2, Eξ w G(λ, w; ξ) w G(λ, w) 2 σ2, and for any i, j, Eξ 2 i,j G(λ, w; ξ) 2 i,j G(λ, w) σ2 h.

Now we show the inductive relations of {wk} generated by Algorithm 1. Lemma 2. Consider (9). Suppose Assumptions 2, 3 and 4 hold. Let {(λk, wk)} be generated by Algorithm 1. Let γ in Algorithm 1 be small enough such that γµ < 1. Let pw := maxi pi,w. Then there exists ι (0, 1) such that

E wk+1 w(λk) 2 ιE wk w(λk 1) 2 + O(α2 k)E dk 1 2 + O(γ2 k)pwσ2. (15)

Remark 1. Note that w(λk) is the true solution of the lower level problem given λk. This induction has additional errors related to the updates of the upper level variable λk and the variance. Note that the variance term in (15) decreases linearly with the maximum dropout rate.

Next, we present the inductive relations of {vk} generated by Algorithm 1. In fact, combining (13) with (12), it is easy to see that

FM(λ) = mλ λF(mλλ, mww(λ)) 2 λw G(λ, w(λ))vmw,λ. (16)

where vmw,λ := 2 ww G(λ, w(λ)) 1 mw w F(mλλ, mww(λ)). This implies that

vmw,λ = βmw w F(mλλ, mww(λ)) + I β 2 ww G(λ, w(λ)) vmw,λ.

Comparing the above equation with the update of vk+1 in Algorithm 1, we see that vk+1 is an approximation of v mk w,λk. The next theorem shows how the difference vk and v mk w,λk varies with iteration k and the dropout rate. Theorem 1. Consider (9). Suppose Assumptions 2, 3 and 4 hold. Let {(λk, wk)} be generated by Algorithm 1. Denote V k := vk vmk 1 w ,λk 1. Denote pλ := maxi pi,λ, pλ := mini pi,λ,

pw := maxi pi,w and pw := mini pi,w. Suppose r Cv := LF 22 µ . Then there exists ϱ (0, 1) such that

E V k+1 2 ϱE V k 2 + O((pw)2β2 k)σ2 + O(pwβ2 k)E wk+1 w(λk) 2

+ O(α2 k)E dk 1 2 + O((pw)2β2 k)σ2 h + O( pλ pw(1 pλpw)β2 k).

Remark 2. The last term in the above theorem are introduced by the dropout of the network and disappear when no dropout is applied to the network parametrized by w.

Now, based on Theorem 1 and Lemma 2, we analyze the convergence of Algorithm 1. Thanks to (11), to find an ϵ stationary point to (9) such that F(λ) 2 ϵ, it suffices to find λ that satisfies E FM(λ) 2 ϵ.

Published as a conference paper at ICLR 2024

(c) CIFAR10

(f) CIFAR10

Figure 2: Results from data cleaning. The first line reports how training and testing accuracy change as the number of iterations increases. The second line details the training and testing accuracy in the final iteration when different dropout rates are applied.

Theorem 2. Consider (9). Suppose Assumptions 2, 3 and 4 hold. Let {(wk, λk)} be generated by Algorithm 1. Suppose F is bounded below by F. Then there exist small δ in Algorithm 1 such that

k=1 E FM k(λk) 2 1

T 1 pwpλ I0 + ln(T + 1)

p2 λ pλpw + pw pw

! σ2 + σ2 h

+ ln(T + 1)

p2 w(1 p2 w) + pw(1 pw) .

Remark 3. The first term shows that the convergence rate is slower when pλpw is small, which can be confirmed in our experiments Figure 2. The last term in the above results is introduced by dropout, and it disappears when pλ = pw = 1, i.e., no dropout is applied.

On the other hand, we have to point that the convergence of E FM k(λk) does not imply that of F(λ). Consider a simple 2-dimensional case. Let f(x1, x2) := 1 2x2 1x2 + 1000x1. Let

xk = (xk 1, xk 2) be with xk 1 = 1000p2+2000(1 p)p

k and xk 2 = k. Let m = diag{m1, m2} with m1 Bernoulli(p) and m2 Bernoulli(p). Then limk Em f(mxk) = 0. However, limk f(xk) = ( 1000p2 2000(1 p)p + 1000, 0). Therefore, the convergence of Emf(mxk) does not necessarily imply the convergence of f(xk), the error between f(xk) and Emf(mxk) can not be closed unless p approaches 1. Therefore, there are additional errors brought only by the dropout. This example implies that the dropout method may not be optimizing the original bilevel problem. This is expected because when the original training loss is optimized, the model fits the training data too well and this will increase the chance of overfitting.

5 EXPERIMENTS

In this section, we test Algorithm 1 on data cleaning tasks (1) and meta learning problem (2). The experiments were conducted on a machine equipped with an Intel Xeon E5-2683 CPU and 4 Nvidia Tesla P40 GPUs.

Data cleaning We followed the approach outlined in (Srivastava et al., 2014) and utilized a fully connected network. The experiments were performed using the MNIST dataset (Le Cun et al., 2010). The network architecture employed was 784-1024-1024-2048-10, with Re LU activation functions used for all hidden units. To introduce unclean data, we randomly mislabel 60% of the training data. When training on MNIST and FMNIST, we use a fully connected network with the architecture 784 1024 1024 2048 10 and employ Re LU activation functions for all hidden units. In all experiments conducted on MNIST and FMNIST, we set the dropout rates of all layers to the same value, denoted as p. Dropout is only applied when updating the lower level parameters. We set γ = 0.01 and train 10000/10000 iterations for 5000/10000 MNIST/FMNIST data points. When

Published as a conference paper at ICLR 2024

(a) Layer 1 with p = 0.8

(b) Layer 2 with p = 0.5

(c) Layer 3 with p = 0.5

(d) Layer 1

(e) Layer 2

(f) Layer 3

Figure 3: Results from few-shot learning on the Omniglot dataset. The first line reports how training and testing accuracy change as the number of iterations increases. The second line details the training and testing accuracy in the final iteration when different dropout rates are applied to only one layer of the network.

training on CIFAR10, we use 4-layer convolutional neural networks + 1 fully connected layer. In all CIFAR10 experiments, we apply dropout after each convolutional layer, using a consistent dropout rate of p. We set γ = 0.1. We train 40000 iterations for 40000 data points.

Meta-learning We conduct experiments with the few-shot learning task, following the experimental protocols of (Vinyals et al., 2016), we performed learning tasks over the Omniglot dataset. We set train/validation/test with 102/172/423, respectively. We perform 5-way-1-shot classification. More specifically, we perform 5 training tasks (N = 5). For each task, we randomly sample 5 characters from the alphabet over that client and for each character, and select 1 data points for training and 15 samples for validation. We use a 4-layer convolutional neural network + 1 fully connected layer. Each convolutional layer has 64 filters of 3 3 and is followed by batch-normalization layers (Finn et al., 2017b). The parameters of convolutional layers are shared between different tasks (λ) and the last linear layer is the task-specific parameter wj. In each experiment, we only add dropout to one CNN layer. In all experiments, we let β, α and γ in FSLA and Algorithm 1 be 0.05, 0.1 and 0.8 respectively.

Results We report the results in Figures 2 and 3. In the first line of both figures, we plot the accuracy against the iteration in each training progress. In the second line of both figures, we plot how different dropout rates affect the training and testing accuracy. Figure 2 shows when the training accuracy increases, the testing accuracy decreases. This observation indicates the occurrence of overfitting in data cleaning tasks. The proper addition of dropout during training can enhance testing accuracy in response. On the other hand, Figure 2 demonstrates that training the network with dropout leads to convergence at a higher testing accuracy and greater stability. Figure 3 shows the accuracy of the method when adding dropout with different rates on the different layers. As we can see, adding a proper dropout to any layer improves the testing accuracy.

6 CONCLUSION

In this paper, we explore the application of dropout in bilevel training tasks. We propose a stochastic bilevel model that is dependent on the distribution of dropout masks. We adapt an existing bilevel method with dropout. We analyze the convergence properties of the resulting method for the proposed model. We investigate the inductive relations of the variables attributes to an approximation of the hypergradients. In addition, we show how the dropout rates affect the complexity. We believe that other state-of-art bilevel methods can also be adapted to address the stochastic bilevel model with random dropout masks, and our convergence analysis serve as the first example for such adaptations.

Published as a conference paper at ICLR 2024

Jimmy Ba and Brendan Frey. Adaptive dropout for training deep neural networks. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/2013/ file/7b5b23f4aadf9513306bcd59afb6e4c9-Paper.pdf.

Juhan Bae and Roger B. Grosse. Delta-stn: Efficient bilevel optimization for neural networks using structured response jacobians. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/ 2020/hash/f754186469a933256d7d64095e963594-Abstract.html.

Pierre Baldi and Peter J. Sadowski. Understanding dropout. In Christopher J. C. Burges, L eon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pp. 2814 2822, 2013. URL https://proceedings.neurips.cc/paper/2013/hash/ 71f6278d140af599e06ad9bf1ba03cb0-Abstract.html.

Ziyi Chen, Bhavya Kailkhura, and Yi Zhou. A fast and convergent proximal algorithm for regularized nonconvex and nonsmooth bi-level optimization. 2022. URL https://doi.org/10. 48550/ar Xiv.2203.16615.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August, 2017a.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August, 2017b.

Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August, 2017.

Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In Jennifer G. Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 1563 1572. PMLR, 2018. URL http://proceedings. mlr.press/v80/franceschi18a.html.

Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 1019 1027, 2016. URL https://proceedings.neurips.cc/paper/2016/hash/ 076a0c97d09cf1a0ec3e19c7f2529f2b-Abstract.html.

Wei Gao and Zhi-Hua Zhou. Dropout rademacher complexity of deep neural networks. Sci. China Inf. Sci., 59(7):072104:1 072104:12, 2016. doi: 10.1007/s11432-015-5470-z. URL https: //doi.org/10.1007/s11432-015-5470-z.

Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming. 2018. URL https://arxiv.org/abs/1802.02246.

Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. 2016. URL http://arxiv.org/abs/1607.05447.

Published as a conference paper at ICLR 2024

Kaiyi Ji, Junjie Yang, and Yingbin Liang. Bilevel optimization: Convergence analysis and enhanced design. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July, 2021.

Alex Labach, Hojjat Salehinejad, and Shahrokh Valaee. Survey of dropout methods for deep neural networks. Co RR, abs/1904.13310, 2019. URL http://arxiv.org/abs/1904.13310.

Y. Le Cun, C. Cortes, and C. Burges. Mnist handwritten digit database. 2010. URL http:// yann.lecun.com/exdb/mnist,2.

Junyi Li, Bin Gu, and Heng Huang. A fully single loop algorithm for bilevel optimization without hessian inverse. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022.

Zhe Li, Boqing Gong, and Tianbao Yang. Improved dropout for shallow and deep learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 16, pp. 2531 2539, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.

Risheng Liu, Pan Mu, Xiaoming Yuan, Shangzhi Zeng, and Jin Zhang. A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July, 2020.

Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August, Online [Palermo, Sicily, Italy], 2020a.

Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In Silvia Chiappa and Roberto Calandra (eds.), The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], volume 108 of Proceedings of Machine Learning Research, pp. 1540 1552. PMLR, 2020b. URL http://proceedings.mlr.press/v108/lorraine20a. html.

Poorya Mianjy and Raman Arora. On convergence and generalization of dropout training. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 21151 21161. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/ file/f1de5100906f31712aaa5166689bfdf4-Paper.pdf.

Sungheon Park and Nojun Kwak. Analysis on the dropout effect in convolutional neural networks. In Shang-Hong Lai, Vincent Lepetit, Ko Nishino, and Yoichi Sato (eds.), Computer Vision ACCV 2016, pp. 189 204, Cham, 2017. Springer International Publishing. ISBN 978-3-319-54184-6.

Vu Pham, Th eodore Bluche, Christopher Kermorvant, and J erˆome Louradour. Dropout improves recurrent neural networks for handwriting recognition. In 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 285 290, 2014. doi: 10.1109/ICFHR.2014.55.

Aravind Rajeswaran, Chelsea Finn, Sham M. Kakade, and Sergey Levine. Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, Vancouver, BC, Canada, 2019.

Albert Senen-Cerda and Jaron Sanders. Almost sure convergence of dropout algorithms for neural networks. Co RR, abs/2002.02247, 2020. URL https://arxiv.org/abs/2002.02247.

Albert Senen-Cerda and Jaron Sanders. Asymptotic convergence rate of dropout on shallow linear neural networks. SIGMETRICS Perform. Eval. Rev., 50(1):105 106, jul 2022. ISSN 01635999. doi: 10.1145/3547353.3530965. URL https://doi.org/10.1145/3547353. 3530965.

Published as a conference paper at ICLR 2024

Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back-propagation for bilevel optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April, Naha, Okinawa, Japan, 2019.

Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, Long Beach, CA, USA, 2017.

Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1): 1929 1958, 2014.

Oriol Vinyals, Charles Blundell, Tim Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, Barcelona, Spain, 2016.

Haotian Wang, Wenjing Yang, Zhenyu Zhao, Tingjin Luo, Ji Wang, and Yuhua Tang. Rademacher dropout: An adaptive dropout for deep neural network via optimizing generalization gap. Neurocomputing, 357:177 187, 2019. doi: 10.1016/j.neucom.2019.05.008. URL https://doi. org/10.1016/j.neucom.2019.05.008.

Haibing Wu and Xiaodong Gu. Towards dropout training for convolutional neural networks. Neural Networks, 71:1 10, 2015. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet. 2015.07.007. URL https://www.sciencedirect.com/science/article/pii/ S0893608015001446.

Zhuoran Yang, Yongxin Chen, Mingyi Hong, and Zhaoran Wang. Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/ file/9713faa264b94e2bf346a1bb52587fd8-Paper.pdf.

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. Co RR, abs/1409.2329, 2014. URL http://arxiv.org/abs/1409.2329.

Ke Zhai and Huan Wang. Adaptive dropout with rademacher complexity regularization. In International Conference on Learning Representations, 2018. URL https://openreview.net/ forum?id=S1uxsye0Z.

Daniel Z ugner and Stephan G unnemann. Adversarial attacks on graph neural networks via meta learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.