# convergence_of_bayesian_bilevel_optimization__32139d7e.pdf

Published as a conference paper at ICLR 2024

CONVERGENCE OF BAYESIAN BILEVEL OPTIMIZATION

Shi Fu1 Fengxiang He2 Xinmei Tian1 Dacheng Tao3

1University of Science and Technology of China, 2University of Edinburgh, 3Nanyang Technological University fs311@mail.ustc.edu.cn, F.He@ed.ac.uk, xinmei@ustc.edu.cn, dacheng.tao@gmail.com

This paper presents the first theoretical guarantee for Bayesian bilevel optimization (BBO) that we term for the prevalent bilevel framework combining Bayesian optimization at the outer level to tune hyperparameters, and the inner-level stochastic gradient descent (SGD) for training the model. We prove sublinear regret bounds suggesting simultaneous convergence of the inner-level model parameters and outer-level hyperparameters to optimal configurations for generalization capability. A pivotal, technical novelty in the proofs is modeling the excess risk of the SGDtrained parameters as evaluation noise during Bayesian optimization. Our theory implies the inner unit horizon, defined as the number of SGD iterations, shapes the convergence behavior of BBO. This suggests practical guidance on configuring the inner unit horizon to enhance training efficiency and model performance.

1 INTRODUCTION

Hyperparameter optimization is a crucial step in the practical implementation of deep learning. Inappropriate hyperparameter configurations can lead to poor model performance and ineffective utilization in real-world systems. Bayesian optimization, relying on probabilistic models, is an effective approach for hyperparameter tuning (Snoek et al. (2012)). It can identify near-optimal configurations within a few iterations, even in non-convex and derivative-free scenarios (Frazier (2018)). However, it often focuses solely on hyperparameters while neglecting model parameters. In contrast, bilevel optimization furnishes an alternative framework that concurrently optimizes hyperparameters and model parameters in a unified architecture (Colson et al. (2007); Franceschi et al. (2018)). Specifically, bilevel optimization nests an inner-level problem of model parameter optimization within an outer-level problem of tuning hyperparameters (Bao et al. (2021).

Bayesian bilevel optimization, merging outer-level hyperparameter tuning with inner-level model parameter optimization via SGD, shows significant promise in engineering applications. Notably, the most common application is training neural network parameters with inner SGD, while tuning critical hyperparameters like learning rates and layer widths with outer Bayesian optimization (Nguyen et al. (2020);Dewancker et al. (2016);Snoek et al. (2015)). However, the underlying working mechanisms and theoretical convergence guarantees of this approach remain unclear. Additionally, properly configuring the inner unit horizon presents challenges. Limited inner SGD iterations can impede model convergence, reducing the accuracy of outer Bayesian optimization evaluations. Thus, more Bayesian iterations may be needed to ensure hyperparameter convergence. However, excessive SGD iterations waste resources, revealing a trade-off between model training and hyperparameter tuning iterations (Hasanpour et al. (2016);Li et al. (2017)). While multi-fidelity Bayesian optimization also examines this trade-off ( Nguyen et al. (2020)), it focuses more on utilizing fidelities to reduce optimization costs. Conversely, our objective is to deduce the optimal inner unit horizon that enables simultaneous convergence of model parameters and hyperparameters to generalization optima.

At the outer layer, Bayesian optimization is adopted for hyperparameter tuning. We utilize two acquisition functions: Expected Improvement (EI) and Upper Confidence Bound (UCB). Regarding EI, Bull (2011) studied its convergence in noise-free settings, while Gupta et al. (2022) recently provided convergence analysis in noisy settings with the standard predictive mean incumbent. For UCB, Srinivas et al. (2010) first introduced the method and derived regret bounds scaling as O( TγT ),

Published as a conference paper at ICLR 2024

where γT denotes the maximum information gain between T observations and the Gaussian process model. Chowdhury & Gopalan (2017) further improved it through an enhanced algorithm.

Previous theoretical studies have relied on certain noise assumptions, including bounded (Srinivas et al. (2010); Nguyen & Gupta (2017)) or sub-Gaussian noise (Chowdhury & Gopalan (2017); Gupta et al. (2022)). However, these assumptions do not adequately capture the intricacies of integrating SGD and Bayesian optimization in a bilevel framework. A key limitation in most prior works is their modeling of the noise {εt}T t=1 as a martingale difference sequence. This approach does not align with hyperparameter optimization, where the objective function often represents generalization performance (Snoek et al. (2012)). Another drawback of prior works (Srinivas et al. (2010); Chowdhury & Gopalan (2017)) is the use of UCB balancing coefficients that increase rapidly, which has been shown to be over-exploratory for many practical problems (De Ath et al. (2021)).

To tackle these challenges, a pivotal innovation of our work involves modeling the excess risk of innerlevel SGD-trained parameters as the primary noise source during outer-level Bayesian optimization. This perspective facilitates more adaptable convergence guarantees for the bilevel setting. Modeling the SGD excess risk is critical since it builds a connection between the analytical frameworks of Bayesian optimization and bilevel optimization. The distinct objectives and assumptions of these two frameworks pose difficulties in ensuring theoretical convergence. Our approach bridges this gap, enabling a comprehensive convergence analysis for BBO.

By modeling the excess risk of SGD as noise, we derive sublinear regret bounds for BBO, guaranteeing simultaneous convergence of model parameters and hyperparameters to optimal configurations for generalization capability. This is accomplished through carefully configuring the inner unit horizon. By determining the SGD-to-Bayesian iteration trade-off, we optimize this ratio to improve computational efficiency. Furthermore, we optimize the UCB balancing coefficients based on the iteration ratio, enhancing flexibility and mitigating excessive exploration resulting from rapidly escalating coefficients in previous works. Our analyses provide valuable insights into balancing computational resources in engineering applications. The main contributions of this work include:

1. We provide a novel theoretical analysis of convergence guarantees for generalization performance within a BBO framework. A key technical innovation is modeling the excess risk of SGD-trained parameters as evaluation noise in outer-level Bayesian optimization. This tackles theoretical analysis challenges by bridging the analytical frameworks of Bayesian optimization and bilevel optimization.

2. We establish a regret bound of O( TγT ) for BBO using the EI function, with a noise assumption more aligned with practical scenarios. Our regret bound achieves savings of γT compared to Gupta et al. (2022). This achievement is realized by optimizing the ratio of SGD iterations N to Bayesian optimization iterations T, also offering implementation guidance.

3. We introduce adaptable balancing coefficients βt for the UCB acquisition function. Through this adaptation, we establish a sublinear regret bound for BBO with the UCB that holds even with fewer SGD steps, enhancing the flexibility of the inner unit horizon. Additionally, we overcome the limitations of rapidly increasing coefficients from previous analyses.

2 RELATED WORK

Hyperparameter optimization is crucial for leveraging deep learning s capabilities (Yang & Shami (2020); Elsken et al. (2019)). Techniques span Bayesian optimization (Wu et al. (2019); Victoria & Maragatham (2021)), decision theory (Bergstra & Bengio (2012)), multi-fidelity methods (Li et al. (2017)), and gradient-based approaches (Maclaurin et al. (2015)). We focus on BBO, exploring Bayesian optimization and bilevel frameworks relevant aspects for hyperparameter tuning.

Bayesian optimization. Bayesian optimization (BO) (Osborne & Osborne (2010); Kandasamy et al. (2020)) is a prevalent approach for hyperparameter tuning by efficiently exploring and exploiting hyperparameter spaces (Nguyen et al. (2017)). Gaussian processes (Bogunovic et al. (2018)) are commonly used as priors in BO to model uncertainty and estimate objective function distributions (Bro (2010); Wilson et al. (2014)). Among acquisition functions guiding queries in BO, the EI acquisition function (Jones & Welch (1998); Malkomes & Garnett (2018); Scarlett et al. (2017); Qin et al. (2017)) is one of the most widely utilized for balancing exploration-exploitation (Nguyen & Osborne (2020); Zhan & Xing (2020)). Other acquisitions like UCB (Valko et al. (2013)), knowledge

Published as a conference paper at ICLR 2024

gradient (Frazier et al. (2009); Scott et al. (2011)), Thompson sampling (Chowdhury & Gopalan (2017)), and predictive entropy search (Hernández-Lobato et al. (2014)) cater to various scenarios.

Theoretical analyses of Bayesian optimization have been conducted to understand their convergence properties (Ryzhov (2016);Gupta et al. (2020)). The EI has been studied under specific conditions (Vazquez & Bect (2010); Bull (2011); Gupta et al. (2021)). Adapting EI for noisy evaluations poses challenges. Wang & de Freitas (2014) addressed this issue by introducing an alternative incumbent selection criterion that necessitates an additional global optimization step. Nguyen & Gupta (2017) used the best-observed value as the incumbent, showing sublinear convergence with a minimum improvement stopping condition. Gupta et al. (2022) analyzed EI convergence with the standard predictive mean incumbent. For UCB, Srinivas et al. (2010) established regret bound as O( TγT ). This approach was subsequently enhanced by ( Janz et al. (2020)) via a modified UCB algorithm.

Bilevel optimization. Bilevel optimization offers an alternative framework for hyperparameter tuning by concurrently optimizing hyperparameters and model parameters in a nested architecture (Colson et al. (2007);Franceschi et al. (2018);Bao et al. (2021)). These methods encompass implicit gradient (Bengio (2000);Luketina et al. (2016);Pedregosa (2016);Lorraine et al. (2020)), hypernetworks (Lorraine & Duvenaud (2018);Mac Kay et al. (2019)), unrolled differentiation (Franceschi et al. (2017);Maclaurin et al. (2015);Bao et al. (2021)), and cross-validation (Bergstra & Bengio (2012)). Implicit gradient-based methods estimate the model s performance gradient with respect to hyperparameters without explicit computation (Lorraine et al. (2020)). Hypernetworks generate model weights conditioned on hyperparameters. Cross-validation approximates optimal hyperparameters using grid search (Abas et al. (2020)) or random search (Bergstra & Bengio (2012)). Unrolled differentiation involves optimizing the inner problem over multiple iterations (Fu et al. (2016)).

Remark. While Bayesian and bilevel optimization have advanced theoretically, extending convergence guarantees to BBO remains challenging. Bayesian optimization analyses often disregard the impact of model parameter convergence on noise during evaluations, instead relying on unrealistic noise assumptions. Theoretical bilevel optimization analyses struggle to extend to the non-convex, nonlinear, and derivative-free scenarios inherent to Bayesian optimization.

3 PRELIMINARIES

We consider bilevel optimization involving model parameters and hyperparameters. Let Z, Θ, and Λ denote the data space, the parameter space, and the hyperparameter space, respectively. Given hyperparameters λ Λ, model parameters θ Θ and distribution D on the data space Z, the expected error L(λ, θ) is defined as Ez D[ℓ(λ, θ, z)], where ℓ: Λ Θ Z R is the bounded loss function. Our objective is to find the hyperparameter-model parameter pair that minimizes the expected error over an unknown distribution. This objective entails nesting two search problems: λ = arg min λ Λ L (λ, θ λ) , where θ λ = arg min θ Θ L (λ, θ) . (3.1)

The inner-level objective is to determine the optimal model parameters, denoted as θ λ, for a given hyperparameter λ. It s worth noting that the value of θ λ depends on the choice of the hyperparameter λ. At the outer level, the goal is to identify the optimal hyperparameter λ , which determines its associated model parameters θ λ , collectively minimizing the expected error. This bilevel structure is very classic and has been applied in many practical scenarios, such as in Darts (Liu et al. (2018)).

Regret. Similar to Gupta et al. (2022), we employ regret as a convergence measure. We define λ and θ λ as the global minimum points, as shown in Equation 3.1. During the t-th iteration at the outer level, the output hyperparameters are denoted as λ+ t . Simultaneously, at the inner level, using these hyperparameters λ+ t , we optimize the model parameters through N steps of SGD, denoted as θN λ+ t . The cumulative regret RT at iteration T is calculated as the sum of instantaneous regrets:

RT = PT t=1 rt, where rt = L(λ , θ λ ) L(λ+ t , θN λ+ t ). Our objective is to demonstrate that RT exhibits almost sublinear growth, specifically lim T RT /T = 0.

3.1 INNER LEVEL OF BBO: MODEL TRAINING BY SGD

At the inner level, our goal is to optimize the model parameters θ by minimizing the expected error. Given the limited knowledge of data distribution, direct access to this expected error is challenging.

Published as a conference paper at ICLR 2024

As a solution, we utilize SGD to optimize the empirical error, serving as a proxy for improving model parameters. Previous studies, such as Hardt et al. (2016), have shown SGD s effectiveness in achieving strong generalization performance. Let Str = {ztr i }n i=1 denote the training dataset, with n representing the number of training samples. Formally, the inner-level problem is defined as: minθ Θ Ltr(λ, θ, Str) = 1

n Pn i=1 ℓ(λ, θ, ztr i ) . Below, we present some essential definitions. Definition 1. (Lipschitz, smoothness, and convexity). Let constants K, γ > 0. Consider the function ℓ: Λ Θ Z R. We define the following properties:

Lipschitz Continuity: The loss ℓ(λ, θ, z) is said to be K-Lipschitz continuous with respect to θ if ℓ(λ, θ1, z) ℓ(λ, θ2, z) K θ1 θ2 for any θ1, θ2, z, λ.

Smoothness: The loss ℓ(λ, θ, z) is said to be γ-Smooth with respect to θ if θℓ(λ, θ1, z) θℓ(λ, θ2, z) γ θ1 θ2 for any θ1, θ2, z, λ.

Convexity: The loss ℓ(λ, θ, z) is said to be convex with respect to θ if ℓ(λ, θ1, z) ℓ(λ, θ2, z) + θℓ(λ, θ2, z), θ1 θ2 for any θ1, θ2, z, λ.

The above definitions are standard concepts that have been widely adopted in related works such as Ghadimi & Wang (2018), Ji & Liang (2023), Grazzi et al. (2020), and Ji et al. (2021). Remark 1. Our BBO theoretical framework can be extended to non-smooth and non-convex situations. Emphasizing BBO convergence, we ve mentioned these concepts for clarity. The proposed structure is highly adaptable for various scenarios. Drawing on insights from Charles & Papailiopoulos (2018), the framework extends to non-convex situations under a µ-gradientdominance condition. For non-convex conditions, Theorem 1 indicates an excess risk of O (N 1/2µ 1/2 + N 1/4µ 3/4) log N . In addition, we eliminate the smoothness assumption, adding a term η

N to the excess risk bound, where η is the step size. Alternatively, relaxing the assumption to α-Hölder continuity introduces an additional term O N 1 1 2(1 α) log N to the excess risk bound. The detailed content is in Appendix F.

3.2 OUTER LEVEL OF BBO: HYPERPARAMETER TUNING BY BO

At the outer level, the model parameters θ λ remain fixed, and our objective is to identify the hyperparameters λ Λ that minimize the expected error L (λ, θ λ). Critically, the function L(λ, θ λ) has a uniquely determined value for each λ. This uniqueness arises because, although θ λ = arg min θ Θ L(λ, θ)

might represent a set when λ is specified, for any θ λ in this set, the loss function L(λ, θ λ) consistently yields the unique value min θ Θ L(λ, θ). Therefore, we may treat L(λ, θ λ) as a function solely dependent

on λ, allowing us to conceptualize it as a mapping function from Λ to R.

3.2.1 REGULARITY ASSUMPTIONS

As noted by Scarlett et al. (2017), Bayesian optimization can be viewed from two distinct perspectives: Bayesian and non-Bayesian. Departing from the standard Bayesian optimization framework, this work adopts a more agnostic, non-Bayesian setting (Srinivas et al. (2010)). Specifically, we assume that hyperparameter space Λ is a compact set. We also posit the reward function L(λ, θ λ) lies within a reproducing kernel Hilbert space (RKHS) Hk(Λ) of functions mapping from Λ to R, characterized by a positive semi-definite kernel function k : Λ Λ R. This RKHS Hk(Λ) is determined by its kernel function k( , ) and features an inner product , k satisfying the reproducing property: L(λ, θ λ) = L( , θ ), k(λ, ) k for all L( , θ ) Hk(Λ). Hence, the kernel k effectively represents the evaluation map at each λ Λ through the RKHS inner product. The RKHS norm, defined as L( , θ ) k = p

L( , θ ), L( , θ ) k. We assume the RKHS norm of the unknown L( , θ ) is bounded by B. This assumption suggests the smoothness of the function. This relationship is showcased in the following example that the inequality |L(λ1, θ λ1) L(λ2, θ λ2)| = | L( , θ ), k(λ1, ) k(λ2, ) | L( , θ ) k k(λ1, ) k(λ2, ) k by the Cauchy-Schwarz inequality.

3.2.2 GAUSSIAN PROCESS (GP) REGRESSION

Bayesian optimization employs GPs as flexible surrogate models for the objective function, leveraging their versatile priors and tractable posteriors (Rasmussen & Williams (2006)). Initially, we assume

Published as a conference paper at ICLR 2024

a GP prior, denoted as GPΛ (0, k( , )), for the unknown reward function L( , θ ) over Λ, with k( , ) representing the kernel function associated with the RKHS Hk(Λ). The algorithm collects observations yt = [y1, . . . , yt]T at points At = {λ1, . . . , λt}, where yt = L λt, θ λt + εt, and εt N 0, σ2 denotes independent and identically distributed Gaussian noise for all t. Consequently, given the history H1:t = {λi, yi}t i=1, let σ2 t (λ) = kt(λ, λ), the posterior distribution over L( , θ ) remains a GP, denoted as P (Lt( , θ ) | H1:t 1, λ) = N µt(λ), σ2 t (λ) , where

µt(λ) = kt(λ)T Kt + σ2I 1 yt, kt (λ, λ ) = k (λ, λ ) kt(λ)T Kt + σ2I 1 kt (λ ) ,

where we define kt(λ) = [k (λ1, λ) . . . k (λt, λ)]T , Kt = [k (λ, λ )]λ,λ At. We assume the kernel function k is fixed and well-defined with k(x, x) 1.

3.2.3 ACQUISITION FUNCTION

Acquisition functions are vital in Bayesian optimization as they guide the choice of the next evaluation point. They quantify the potential improvement of each candidate point. This section explores two prevalent acquisition functions: UCB (Srinivas et al. (2010)) and EI (Jones & Welch (1998)).

Upper Confidence Bound. The UCB acquisition balances exploration and exploitation by selecting points with potential improvement while considering evaluation uncertainty. It is defined as:

UCBt(λ) = µt 1(λ) + βtσt 1(λ), (3.2)

where βt denotes coefficients regulating the trade-off. The µt 1(λ) term promotes exploitation by favoring points with lower estimated mean values, thereby increasing the likelihood of improved outcomes. The σt 1(λ) term drives exploration by prioritizing highly uncertain points.

Expected Improvement. The EI quantifies the expected enhancement in the incumbent relative to the objective function value. Typically, the incumbent is chosen as the minimum GP mean observed so far, denoted as µ+ t = minλi At µt 1(λi), consistent with Gupta et al. (2022). The EI is defined as EIt(λ) = E max{µ+ t L(λ, θ λ), 0} | H1:t . The closed-form expression for EI can be derived as:

EIt(λ) = (µ+ t µt 1(λ))Φ µ+ t µt 1(λ)

+ σt 1(λ)ϕ µ+ t µt 1(λ)

, if σt 1(λ) > 0,

where ϕ is the standard normal p.d.f. and Φ is the c.d.f. When σt 1(λ) = 0, we set EIt(λ) = 0.

4 MAIN RESULTS

This section presents theoretical analyses of the convergence guarantees for BBO. Specifically, We derive an excess risk bound for the inner-level SGD-trained model parameters. We also establish regret bounds for the outer-level Bayesian optimization that ensures simultaneous convergence of model parameters and hyperparameters. The BBO algorithms are detailed in the Appendix. Based on the convergence analysis, we characterize the trade-off between model training and tuning.

4.1 INNER LEVEL OF BBO: EXCESS RISK BOUND

First, we present the excess risk bound for inner-level optimization. Theorem 1. Suppose that the function ℓ(λ, θ, z) is K-Lipschitz continuous, γ-smooth and convex with respect to θ, uniformly bounded by M. We perform SGD with step sizes ηj = η 1

N 2/γ on a sample Str drawn from the distribution D at the inner level. This involves performing N steps and obtaining the output result θN λ . Let Sval = zval i m i=1 represent the validation set drawn from the distribution D. The validation error is defined as Lval(λ, θN λ , Sval) = 1 m Pm i=1 ℓ(λ, θN λ , zval i ). Choose N n m. Then, with a probability of at least 1 δ, we have:

Lval(λ, θN λ , Sval) L(λ, θ λ) = O N 1

The complete expression is given by Lval(λ, θN λ , Sval) L(λ, θ λ) φ(N)N 1

2 , where φ(N) =

O K2 log N log(1/δ) + M p

log(1/δ) + M p

log(2/δ) + log3/2(N/δ) .

Published as a conference paper at ICLR 2024

Remark 2. In outer-level Bayesian optimization, the objective is to minimize the expected error L(λ, θ λ). However, as the data distribution is unknown, L(λ, θ λ) cannot be evaluated directly. Instead, the validation loss Lval(λ, θN λ , Sval) serves as a proxy to observe L(λ, θ λ). Specifically, given hyperparameters λ and model parameters θN λ from inner-level SGD, the difference between Lval(λ, θN λ , Sval) and L(λ, θ λ) represents noise when evaluating L(λ, θ λ) in outer optimization. Analyzing the convergence of this excess risk bound Lval(λ, θN λ , Sval) L(λ, θ λ) is of interest.

4.2 OUTER LEVEL OF BBO: REGRET BOUND

This section presents regret bounds for the outer-level Bayesian optimization. The subsequent lemma delineates the concentration of the posterior mean around the true objective function, thereby constituting the theoretical basis for the ensuing regret analysis. Notably, as explicated in Section 3.2, L(λ, θ λ) may be considered as functions mapping from Λ to R.

Lemma 2. let L( , θ ) : Λ R be a member of the RKHS of real-valued functions on Λ with kernel k, with RKHS norm bounded by B. Consider the noise term εt = Lval λt, θN λt, Sval L λt, θ λt , then, with probability at least 1 δ, the following holds for all λ Λ and t 1:

|µt(λ) L(λ, θ λ)| p

B2 + σ 2tφ2(N)N 1σt(λ).

Subsequently, we introduce the definition of maximum information gain, which will prove useful in bounding the cumulative regret. Definition 2. Given the set AT = {λ1, , λT } Λ, let LAT ( , θ ) = [L (λ, θ λ)]λ AT and y AT = LAT ( , θ )+εAT , where εAT N 0, σ2I . With I ( ; ) denoting the mutual information, the maximum information gain after T iterations is defined as γT = max AT Λ I (y AT ; LAT ( , θ )).

Notably, the maximum information gain γT differs under various kernel selections. According to the results from Srinivas et al. (2010), for the Squared Exponential kernel, γT = O(logd+1(T)), whereas based on the findings in Vakili et al. (2021), for the Matérn-ν kernel, γT = e O(T d 2ν+d ).

4.2.1 REGRET BOUND FOR BBO WITH EI FUNCTION

In the ensuing analysis, we establish the primary theoretical contribution of this work, specifically, the regret bound for Bayesian bilevel optimization with the EI acquisition function. For the sake of clarity in notation, we define the auxiliary function τ(z) = zΦ(z) + ϕ(z). Theorem 3. Let δ (0, 1). Assume that the true underlying function L(λ, θ λ) in terms of λ lies in the RKHS Hk(Λ) associated with the kernel k (λ, λ ). Furthermore, consider the noise term εt = Lval λt, θN λt, Sval L λt, θ λt . Specifically, assume that L ( , θ ) k B and define βt = p

B2 + σ 2tφ2(N)N 1. By executing algorithm 1 with EI acquisition and the prior GPΛ (0, k( , )), with probability at least 1 δ, the cumulative regret is bounded as:

RT = O β2 T TγT τ (βT ) βT + TN 1

If we select N T, we attain: RT = O p

Proof Sketch of Theorem 3. Adapting Bayesian regret analysis to the bilevel optimization context with noise induced by SGD poses a central challenge. To bound the cumulative regret RT , the initial step involves constraining the instantaneous regrets rt, defined as rt = L λ+ t , θN λ+ t

L (λ , θ λ ).

Each instantaneous regret rt can be decomposed into three terms: rt = L λ+ t , θN λ+ t

L λ+ t , θ λ+ t

L λ+ t , θ λ+ t

µ+ t + µ+ t L (λ , θ λ ), where µ+ t denotes the incumbent.

The first term, L λ+ t , θN λ+ t

L λ+ t , θ λ+ t

, can be bounded by applying concentration inequalities

and the excess risk result from Theorem 1, yielding an upper bound of O(N 1/2 log N). For the

Published as a conference paper at ICLR 2024

second term, L λ+ t , θ λ+ t

µ+ t , the bound is given by βt

2π (βt + 1) σt 1 (λt) +

2πIt (λt) ,

where It(λt) = max µ+ t L λt, θ λt , 0 . To bound the last term µ+ t L (λ , θ λ ), our approach involves initially establishing lower and upper bounds on the EI acquisition value by leveraging the properties of τ(z). A pivotal technical challenge arises in bounding the discrepancy between the GP posterior mean µt( ) and the true function value L( , θ ) at λt and λ , which requires considering the SGD-induced noise as detailed in Lemma 2. The optimality of λt from maximizing the EI acquisition leads to an upper bound given by: ( 1+βt τ(βt) βt )(It(λt) + (βt + 1)σt 1(λt)), as shown in Lemma 15.

Bounding and summing the three constituent terms allows us to derive the cumulative regret RT . Employing auxiliary results related to the cumulative sums of improvement terms It(λt) and predictive variances σt(λt), as documented in Gupta et al. (2022) and Chowdhury & Gopalan (2017), and setting N T, we attain the bound RT = O( TγT ).

Comparision with previous works. Through appropriate selection of N, our regret bound is denoted as RT = O( TγT ). In contrast, Nguyen & Gupta (2017) established a regret bound of O(γT T log T) for EI with the best-observed value y+ as the incumbent. They further introduced an additional assumption constraining the variance function values to exceed a lower bound κ. Thus, their bound escalates markedly as κ . The work most analogous to ours is Gupta et al. (2022), which also utilizes EI with the standard incumbent µ+ t = minλi At µt 1(λi), and possesses regret bounds of RT = O(γT

T). Our regret bound thus achieves theoretical savings of γT .

Furthermore, It s worth noting that Gupta et al. (2022) assumed conditional R-sub-Gaussian noise, while Nguyen & Gupta (2017) assumed bounded noise. Additionally, both studies assumed that the noise sequence {εt}T t=1 constitutes a martingale difference sequence, signifying that E[εt|ε<t] = 0 for all t T. In comparison, our assumption regarding the noise εt, denoted as the excess risk of the inner-level SGD-optimized model parameters, is εt = Lval(λt, θN λt, Sval) L(λt, θ λt). The conditional expectation E [εt | ε<t] = 0 does not hold here. This is because θN λt depends on the training set Str while θ λt does not. This modeling more aligns closely with practical scenarios when leveraging Bayesian optimization for hyperparameter tuning.

Moreover, prior research has primarily restricted its regret definition to hyperparameters. In contrast, as outlined in Section 3, our approach uniquely defines regret to include both model parameters and hyperparameters. This broader view adds complexity to our theoretical analysis by necessitating the establishment of a sublinear regret bound. Our derived bound offers theoretical validation for the convergence of both model parameters and hyperparameters towards their optimal values.

Practical insights: Inner Unit Horizon as the Same Order of Outer-Level Iterations.. By judiciously selecting the number of inner-level SGD optimization iterations as N T, we attained the regret bound RT = O( TγT ). This further provides practical guidance into better balancing the iterations between inner-level SGD and outer-level Bayesian optimization.

Specifically, if the number of SGD iterations N is substantially less than T, such as O(

T), the regret becomes RT = O(T γT /(τ(T 1/4) T 1/4)). Since γT monotonically increases and τ(z) z monotonically decreases to zero, when T , we have γT and τ(T 1/4) T 1/4 0. Consequently, RT /T = O( γT /(τ(T 1/4) T 1/4)) will exhibit rapid growth. Indeed, this suggests that having too few inner SGD iterations results in a faster increase in the regret bound. In contrast, if N significantly exceeds T, then βT B2, maintaining a constant order instead of diminishing. Hence, the regret RT = O( TγT ) remains unchanged, indicating unnecessary SGD iterations being performed and wasted resources.

In summary, the optimal inner unit horizon N for convex functions scales as N T. Moreover, we demonstrate in Section F.1 that for non-convex functions, the optimal N T 2. This choice of N strikes an effective balance between SGD and Bayesian optimization iterations.

4.2.2 REGRET BOUND FOR BBO WITH THE UCB FUNCTION

Next, we establish the regret bound for the Bayesian bilevel optimization with the UCB acquisition. Theorem 4. Let δ (0, 1). Assume that the true underlying function L(λ, θ λ) in terms of λ lies in the RKHS Hk(Λ) associated with the kernel k (λ, λ ). Additionally, consider the noise term εt = Lval λt, θN λt, Sval L λt, θ λt . Specifically, assume that L ( , θ ) k B and let

Published as a conference paper at ICLR 2024

B2 + σ 2tφ2(N)N 1. By running algorithm 2 with βt and the prior GPΛ (0, k( , )), with probability at least 1 δ, the cumulative regret is bounded as:

(B2 + TN 1) TγT + TN 1

If we select N T, we can obtain that:

Proof Sketch of Theorem 4. This proof approach is similar to Theorem 3, but employs a different acquisition function. The instantaneous regret rt can be decomposed into two terms: rt = L λ+ t , θN λ+ t

L λ+ t , θ λ+ t

+ L λ+ t , θ λ+ t

L (λ , θ λ ). The first term, identical to the first term in Theorem 3, can be established using the same methodology, yielding the bound of O(N 1/2 log N). Through the utilization of Lemma 2 and the selection of λ+ t as specified in Algorithm 2, we can demonstrate that the second term can be upper-bounded as follows: 2βtσt 1 λ+ t . By summing the upper bounds of the two terms mentioned above and taking into account that PT t=1 σt 1 λ+ t = O TγT as demonstrated in Chowdhury & Gopalan (2017), the final result

can be deduced: RT = O p

(B2 + TN 1) TγT + TN 1/2 with probability at least 1 δ. By

selecting N T, we can further establish that: RT = O TγT .

Comparision with previous works. For the UCB, βt plays a crucial role in balancing exploration and exploitation. In previous work, Srinivas et al. (2010) introduced the GP-UCB algorithm with βt = (2B2 + 300γt log3(t/δ))1/2, while a slightly improved variant, the IGP-UCB algorithm, presented in Chowdhury & Gopalan (2017), used βt = B + R p

2 (γt 1 + 1 + ln(1/δ)). In both cases, the balancing coefficient βt tends to increase rapidly with t, especially when the Matérn-ν kernel is used. In practice, this leads to numerous unnecessary explorations in Bayesian optimization (De Ath et al. (2021)). However, in our work, we have chosen the balancing coefficients as βt = p

B2 + σ 2tφ2(N)N 1. As a result, we can appropriately adjust the ratio of the number of iterations N in the inner level to the number of iterations T in the outer level to control βt.

Furthermore, by choosing N T we can obtain a regret bound O B TγT , which is tighter

than the regret bound O B TγT + γT

T associated with Chowdhury & Gopalan (2017). Our assumptions regarding noise are also more aligned with reality. This approach enables us to achieve an improved balance between exploration and exploitation in Bayesian optimization. Moreover, our regret bound provides theoretical assurance for the convergence of both model parameters and hyperparameters to their respective optima, whereas Chowdhury & Gopalan (2017) only guarantees the convergence of hyperparameters.

Practical insights: Flexibility of the Inner Unit Horizon. A choice of N T is also reasonable for the UCB function. If N is significantly larger than T, the coefficient B2 + TN 1 tends to approach B2, remaining constant rather than approaching zero, and the regret bound remains RT = O( TγT ), showing no change. This implies that inner-level SGD involves many futile iterations.

However, it s important to note that, unlike the situation for the EI acquisition function mentioned above, if N is significantly smaller than T, for instance, N = O(

T), then the regret bound becomes RT = O(T 3/4 γT ). For the SE kernel, where γT = O(logd+1(T)), the regret bound RT still exhibits sublinear growth. In the case of the Matérn-ν kernel, with γT = e O(T d 2ν+d ), when 2ν > d, the regret bound RT can also exhibit sublinear growth.

In summary, the UCB acquisition offers greater flexibility in choosing the number of iterations for inner-level SGD iterations compared to EI. With fewer SGD iterations, the noise εt in function evaluations increases. Therefore, our theoretical analysis shows that UCB is more resilient to the influence of noise compared to EI, which aligns with the experimental findings in Hu et al. (2021).

5 EXPERIMENTS

We conducted numerical results in this section. In the inner level, we employ SGD to train a CNN with two convolutional layers and one fully connected layer on the MNIST dataset. In the outer level,

Published as a conference paper at ICLR 2024

Bayesian Optimization uses the EI and UCB functions to adjust hyperparameters like the learning rate. We fix the number of iterations for the outer-level BO and compare the number of iterations for the inner-level SGD under different scenarios, along with their respective convergence outcomes, as detailed in the table below. We initially present the results when utilizing the EI acquisition function.

(1) Setting the number of outer Bayesian optimization steps as 20.

SGD Iteration 100 500 1,000 2,000 3,000 Performance 3.15 0.75 2.73 0.03 2.70 0.04 2.43 0.12 2.46 0.13

(2) Setting the number of outer Bayesian optimization steps as 40.

SGD Iteration 500 1,000 2,000 3,000 4,000 Performance 3.45 0.85 2.42 0.23 2.44 0.55 2.39 0.09 2.51 0.09

(3) Setting the number of outer Bayesian optimization steps as 60.

SGD Iteration 1,000 2,000 3,000 4,000 6,000 Performance 2.58 0.58 2.40 0.35 2.34 0.28 2.10 0.08 2.25 0.17

The experiments are aligned with our theoretical analysis. Fixing the Bayesian optimization s iteration number, the loss function decreases as the SGD steps rise initially, while suboptimal hyperparameters cause high loss. Specifically, with only 20 outer-level iterations, excessively low tuning led to high loss even at high SGD steps. For 40 outer-level iterations and over 1,000 SGD steps, we see signs of overtraining and inadequate tuning. For 60-step outer-level iterations, the loss at over 4,000 SGD steps is less than in the 40-step setting, due to more adequate tuning. Yet, insufficient tuning still occurs at 6,000 SGD steps under the 60-step setting.

Then we present the results when utilizing the UCB acquisition function.

(1) Setting the number of outer Bayesian optimization steps as 20.

SGD Iteration 100 500 1,000 2,000 3,000 Performance 2.93 0.60 2.59 0.47 2.49 0.34 2.39 0.07 2.31 0.13

(2) Setting the number of outer Bayesian optimization steps as 40.

SGD Iteration 500 1,000 2,000 3,000 4,000 Performance 2.56 0.53 2.34 0.45 2.29 0.29 2.27 0.30 2.22 0.16

(3) Setting the number of outer Bayesian optimization steps as 60.

SGD Iteration 1,000 2,000 3,000 4,000 6,000 Performance 2.57 0.27 2.26 0.28 2.23 0.17 2.19 0.19 2.20 0.10

When utilizing the UCB acquisition function for the outer Bayesian optimization, while keeping the number of steps for the outer Bayesian optimization fixed, we gradually increase the number of iterations for the inner SGD optimization. We can observe that initially, as the number of SGD iterations increases, the loss decreases gradually. However, when the number of SGD steps becomes excessively large, the decrease in loss tends to plateau.

6 CONLUSION

Through solid analysis, we derive sublinear regret bounds guaranteeing simultaneous convergence of model parameters and hyperparameters to optimal configurations for generalization capability and providing insights into optimally balancing computations between model training and hyperparameter tuning. This theoretical foundation demonstrates the potential of Bayesian bilevel optimization for non-convex, derivative-free hyperparameter tuning. Future work may enrich this area through empirical studies, alternate acquisitions, and extensions to neural networks.

Published as a conference paper at ICLR 2024

ACKNOWLEDGEMENT

We appreciate Hang Yu from the University of Sydney for his assistance in experiments. This work is supported by NSFC project No. 62222117 and the Fundamental Research Funds for the Central Universities under contracts WK3490000005 and KY2100000117.

A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Co RR, abs/1012.2599, 2010.

Mohamad Aqib Haqmi Abas, Nurlaila Ismail, NA Ali, S Tajuddin, and Nooritawati Md Tahir. Agarwood oil quality classification using support vector classifier and grid search cross validation hyperparameter tuning. Int. J, 8, 2020.

Fan Bao, Guoqiang Wu, Chongxuan Li, Jun Zhu, and Bo Zhang. Stability and generalization of bilevel programming in hyperparameter optimization. Advances in neural information processing systems, 34:4529 4541, 2021.

Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. Advances in Neural Information Processing Systems, 33, 2020.

Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8): 1889 1900, 2000.

James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of machine learning research, 13(2), 2012.

Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka, and Volkan Cevher. Adversarially robust optimization with gaussian processes. Advances in neural information processing systems, 31, 2018.

Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2:499 526, 2002.

Olivier Bousquet, Yegor Klochkov, and Nikita Zhivotovskiy. Sharper bounds for uniformly stable algorithms. In Conference on Learning Theory, pp. 610 626, 2020.

Adam D Bull. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12(10), 2011.

Zachary Charles and Dimitris Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. In International Conference on Machine Learning, pp. 744 753, 2018.

Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In International Conference on Machine Learning, pp. 844 853. PMLR, 2017.

Benoît Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimization. Annals of operations research, 153:235 256, 2007.

George De Ath, Richard M Everson, Alma AM Rahat, and Jonathan E Fieldsend. Greed is good: Exploration and exploitation trade-offs in bayesian optimisation. ACM Transactions on Evolutionary Learning and Optimization, 1(1):1 22, 2021.

Ian Dewancker, Michael Mc Court, and Scott Clark. Bayesian optimization for machine learning: A practical guidebook. ar Xiv preprint ar Xiv:1612.04858, 2016.

Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997 2017, 2019.

Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning, pp. 1165 1173. PMLR, 2017.

Published as a conference paper at ICLR 2024

Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International conference on machine learning, pp. 1568 1577. PMLR, 2018.

P. I. Frazier. A tutorial on bayesian optimization. In ar Xiv preprint ar Xiv:1807.02811, 2018.

Peter Frazier, Warren Powell, and Savas Dayanik. The knowledge-gradient policy for correlated normal beliefs. INFORMS Journal on Computing, 21(4):599 613, 2009.

Jie Fu, Hongyin Luo, Jiashi Feng, Kian Hsiang Low, and Tat-Seng Chua. Drmad: distilling reversemode automatic differentiation for optimizing hyperparameters of deep neural networks. ar Xiv preprint ar Xiv:1601.00917, 2016.

Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246, 2018.

Riccardo Grazzi, Luca Franceschi, Massimiliano Pontil, and Saverio Salzo. On the iteration complexity of hypergradient computation. In International Conference on Machine Learning, pp. 3748 3758. PMLR, 2020.

Sunil Gupta, Santu Rana, Huong Ha, Svetha Venkatesh, et al. Sub-linear regret bounds for bayesian optimisation in unknown search spaces. Advances in Neural Information Processing Systems, 33: 16271 16281, 2020.

Sunil Gupta, Santu Rana, Svetha Venkatesh, et al. Bayesian optimistic optimisation with exponentially decaying regret. In International Conference on Machine Learning, pp. 10390 10400. PMLR, 2021.

Sunil Gupta, Santu Rana, Svetha Venkatesh, et al. Regret bounds for expected improvement algorithms in gaussian process bandit optimization. In International Conference on Artificial Intelligence and Statistics, pp. 8715 8737. PMLR, 2022.

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pp. 1225 1234, 2016.

Seyyed Hossein Hasanpour, Mohammad Rouhani, Mohsen Fayyaz, and Mohammad Sabokrou. Lets keep it simple, using simple architectures to outperform deeper and more complex architectures. ar Xiv preprint ar Xiv:1608.06037, 2016.

José Miguel Hernández-Lobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. Advances in neural information processing systems, 27, 2014.

Jiabin Hu, Yuze Jiang, Jiayu Li, and Tianyue Yuan. Alternative acquisition functions of bayesian optimization in terms of noisy observation. In Proceedings of the 2021 European Symposium on Software Engineering, pp. 112 119, 2021.

David Janz, David Burt, and Javier González. Bandit optimisation of functions in the matérn kernel rkhs. In International Conference on Artificial Intelligence and Statistics, pp. 2486 2495. PMLR, 2020.

Kaiyi Ji and Yingbin Liang. Lower bounds and accelerated algorithms for bilevel optimization. J. Mach. Learn. Res., 24:22 1, 2023.

Kaiyi Ji, Junjie Yang, and Yingbin Liang. Bilevel optimization: Convergence analysis and enhanced design. In International conference on machine learning, pp. 4882 4892. PMLR, 2021.

Schonlau M. Jones, D. R. and W. J. Welch. Efficient global optimization of expensive black-box functions. In Journal of Global Optimization , 13(4):455 492, 1998.

Kirthevasan Kandasamy, Karun Raju Vysyaraju, Willie Neiswanger, Biswajit Paria, Christopher R Collins, Jeff Schneider, Barnabas Poczos, and Eric P Xing. Tuning hyperparameters without grad students: Scalable and robust bayesian optimisation with dragonfly. The Journal of Machine Learning Research, 21(1):3098 3124, 2020.

Published as a conference paper at ICLR 2024

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximalgradient methods under the polyak-łojasiewicz condition. In European Conference on Machine Learning, pp. 795 811, 2016.

Yunwen Lei and Ke Tang. Stochastic composite mirror descent: Optimal bounds with high probabilities. In Advance in Neural Information Processing Systems, pp. 1524 1534, 2018.

Yunwen Lei and Yiming Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning, pp. 5809 5819, 2020.

Yunwen Lei and Yiming Ying. Sharper generalization bounds for learning with gradient-dominated objective functions. In International Conference on Learning Representations, 2021.

Lisha Li, Kevin Jamieson, Giulia De Salvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. The journal of machine learning research, 18(1):6765 6816, 2017.

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. ar Xiv preprint ar Xiv:1806.09055, 2018.

Jonathan Lorraine and David Duvenaud. Stochastic hyperparameter optimization through hypernetworks. ar Xiv preprint ar Xiv:1802.09419, 2018.

Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In International conference on artificial intelligence and statistics, pp. 1540 1552. PMLR, 2020.

Jelena Luketina, Mathias Berglund, Klaus Greff, and Tapani Raiko. Scalable gradient-based tuning of continuous regularization hyperparameters. In International conference on machine learning, pp. 2952 2960. PMLR, 2016.

Matthew Mac Kay, Paul Vicol, Jon Lorraine, David Duvenaud, and Roger Grosse. Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions. ar Xiv preprint ar Xiv:1903.03088, 2019.

Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International conference on machine learning, pp. 2113 2122. PMLR, 2015.

Gustavo Malkomes and Roman Garnett. Automating bayesian optimization with bayesian optimization. Advances in Neural Information Processing Systems, 31, 2018.

Vu Nguyen and Gupta. Regret for expected improvement over the best-observed value and stopping condition. In Proceedings of the Ninth Asian Conference on Machine Learning, volume 77 of Proceedings of Machine Learning Research, pp. 279 294. PMLR, 15 17 Nov 2017.

Vu Nguyen and Michael A Osborne. Knowing the what but not the where in bayesian optimization. In International Conference on Machine Learning, pp. 7317 7326. PMLR, 2020.

Vu Nguyen, Sunil Gupta, Santu Rane, Cheng Li, and Svetha Venkatesh. Bayesian optimization in weakly specified search space. In 2017 IEEE International Conference on Data Mining (ICDM), 2017.

Vu Nguyen, Sebastian Schulze, and Michael Osborne. Bayesian optimization for iterative learning. Advances in Neural Information Processing Systems, 33:9361 9371, 2020.

Michael Osborne and Michael Alan Osborne. Bayesian Gaussian processes for sequential prediction, optimisation and quadrature. Ph D thesis, Oxford University, UK, 2010.

Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In International conference on machine learning, pp. 737 746. PMLR, 2016.

Chao Qin, Diego Klabjan, and Daniel Russo. Improving the expected improvement algorithm. Advances in Neural Information Processing Systems, 30, 2017.

Published as a conference paper at ICLR 2024

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. Adaptive computation and machine learning. 2006.

Ilya O Ryzhov. On the convergence rates of expected improvement methods. Operations Research, 64(6):1515 1528, 2016.

Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. Lower bounds on regret for noisy gaussian process bandit optimization. In Conference on Learning Theory, pp. 1723 1742. PMLR, 2017.

Warren Scott, Peter Frazier, and Warren Powell. The correlated knowledge gradient for simulation optimization of continuous parameters using gaussian process regression. SIAM Journal on Optimization, 21(3):996 1026, 2011.

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25, 2012.

Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat, and Ryan P. Adams. Scalable bayesian optimization using deep neural networks. In International Conference on Machine Learning, 2015.

Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML 10, pp. 1015 1022, 2010.

Sattar Vakili, Kia Khezeli, and Victor Picheny. On information gain and regret bounds in gaussian process bandits. In International Conference on Artificial Intelligence and Statistics, pp. 82 90. PMLR, 2021.

Michal Valko, Nathaniel Korda, Rémi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of kernelised contextual bandits. ar Xiv preprint ar Xiv:1309.6869, 2013.

Emmanuel Vazquez and Julien Bect. Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. Journal of Statistical Planning and inference, 140(11): 3088 3095, 2010.

A Helen Victoria and Ganesh Maragatham. Automatic tuning of hyperparameters using bayesian optimization. Evolving Systems, 12:217 223, 2021.

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.

Ziyu Wang and Nando de Freitas. Theoretical analysis of bayesian optimisation with unknown gaussian process hyper-parameters. ar Xiv preprint ar Xiv:1406.7758, 2014.

Aaron Wilson, Alan Fern, and Prasad Tadepalli. Using trajectory data to improve bayesian optimization for reinforcement learning. The Journal of Machine Learning Research, 15(1):253 282, 2014.

Jia Wu, Xiu-Yun Chen, Hao Zhang, Li-Dong Xiong, Hang Lei, and Si-Hao Deng. Hyperparameter optimization for machine learning models based on bayesian optimization. Journal of Electronic Science and Technology, 17(1):26 40, 2019.

Li Yang and Abdallah Shami. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415:295 316, 2020.

Dawei Zhan and Huanlai Xing. Expected improvement for expensive optimization: a review. Journal of Global Optimization, 78(3):507 544, 2020.

Published as a conference paper at ICLR 2024

A BAYESIAN BILEVEL OPTIMIZATION ALGORITHMS

In this section, we will present the algorithms employed in this study. To commence, we will introduce Bayesian bilevel optimization utilizing the expected improvement acquisition function as follows.

Algorithm 1: Bayesian bilevel Optimization with EI acquisition Input: Place a Gaussian process prior on L(λ, θ λ)

1 for t = 0, 1, ..., T 1 do

2 for j = 0, 1, ..., N 1 do

3 Update model parameters using SGD: θj+1 λt = θj λt ηj θℓ(λt, θj λt, ztr jt )

5 Observe yt = Lval(λt, θN λt, Sval) and Update posterior distribution on L using all data.

6 Select the next hyperparameter λt+1 by maximizing the acquisition function EIt(λ) with the incumbent µ+ t = minλi At µt 1 (λi).

7 λt+1 = argmaxλ E max µ+ t L (λ, θ λ) , 0 | H1:t

9 Return points λ+ t = argmin1 i t µt 1 (λi) and θN λ+ t for all 1 t T.

Next, we will introduce Bayesian bilevel optimization using the upper confidence bound acquisition function.

Algorithm 2: Bayesian bilevel optimization with UCB acquisition Input: Place a Gaussian process prior on L(λ, θ λ)

1 for t = 0, 1, ..., T 1 do

2 for j = 0, 1, ..., N 1 do

θj+1 λt = θj λt ηj θℓ λt, θj λt, ztr jt .

5 yt represents the observation of L(λt, θ λt), where yt = Lval(λt, θN λt, Sval).

6 Update the posterior probability distribution on L using all available data.

7 Select the next hyperparameter λ+ t+1 by maximizing the acquisition function UCBt(λ).

λt+1 = argmaxλ ( µt(λ) + βt+1σt(λ))

9 Return points λ+ t = λt and θN λ+ t for all 1 t T.

B PROOF OF THEOREM 1

At the inner level, we keep the hyperparameters λ fixed and treat them as constants, while we employ SGD to train and optimize the model parameters θ. To establish an upper bound on the excess risk Lval(λ, θN λ , Sval) L(λ, θ λ), we introduce an effective decomposition method:

Lval λ, θN λ , Sval L (λ, θ λ) = Lval λ, θN λ , Sval Ltr λ, θN λ , Str

| {z } Term 1 + Ltr λ, θN λ , Str Ltr λ, θ λ, Str

| {z } Term 2

+ Ltr λ, θ λ, Str L (λ, θ λ) | {z } Term 3

Next, we proceed to upper bound the excess risk by individually upper bounding Term 1, Term 2, and Term 3.

Upper Bounding Term 1

Directly upper bounding Term 1 is challenging because there is no direct connection between the validation error Lval λ, θN λ , Sval and the training error Ltr λ, θN λ , Str . However, we can

Published as a conference paper at ICLR 2024

observe that both the validation set Sval and the training set Str are drawn from the distribution D. Therefore, by using the expected error L(λ, θN λ ) = Ez D[ℓ(λ, θN λ , z)] as a bridge, we can establish a connection between the validation error and training error. This allows us to decompose Lval λ, θN λ , Sval Ltr λ, θN λ , Str into:

Lval λ, θN λ , Sval Ltr λ, θN λ , Str

= Lval λ, θN λ , Sval L(λ, θN λ ) | {z } Term 4

+ L(λ, θN λ ) Ltr λ, θN λ , Str

| {z } Term 5

Lemma 5. Let the validation error be Lval(λ, θN λ , Sval) = 1

m Pm i=1 ℓ(λ, θN λ , zval i ). Similarly, let the training error be Ltr(λ, θN λ , Str) = 1

n Pn i=1 ℓ(λ, θN λ , ztr i ). By choosing N n m and η N 1

2 , we can establish that, with a probability of at least 1 δ,

Lval λ, θN λ , Sval Ltr λ, θN λ , Str = O N 1

To prove Lemma 5, we provide the definition of stability below, along with some useful concentration inequalities.

We use θ(S) to represent the model obtained when applying an algorithm A to the dataset S. Note that we omit the dependency of the notation on A, which should be clear from the context. Additionally, we omit the hyperparameters λ since we keep them fixed and treat them as constants at the inner level. Two datasets are considered neighboring datasets if they differ by only one example.

Definition 3. (Uniform Stability Bousquet & Elisseeff (2002)) A randomized algorithm A is ϵ uniformly stable if for all neighboring datasets S, S D we have supz [ℓ(θ(S), z) ℓ(θ(S ), z)] ϵ.

Lemma 6 (Mc Diarmid s Inequality). Consider independent random variables Z1, , Zn Z and a mapping ϕ : Zn R. If, for all i {1, , n}, and for all z1, , zn, z i Z, the function ϕ satisfies

|ϕ (z1, , zi 1, zi, zi+1, , zn) ϕ (z1, , zi 1, z i, zi+1, , zn)| c,

P (|ϕ (Z1, , Zn) Eϕ (Z1, . . . , Zn) t|) 2 exp 2t2

Furthermore, for any δ (0, 1) the following inequality holds with probability at least 1 δ

|ϕ (Z1, . . . , Zn) E [ϕ (Z1, . . . , Zn)]| c p

Lemma 7 (Chernoff s Bound (Wainwright (2019)). Let X1, . . . , Xt be independent random variables taking values in {0, 1}. Let X = Pt j=1 Xj and µ = E[X]. Then for any δ > 0 with probability at least 1 exp µ δ2/(2 + δ) we have X (1 + δ)µ. Furthermore, for any δ (0, 1) with probability at least 1 δ we have

X µ + log(1/δ) + p

2µ log(1/δ).

Lemma 8 (Generalization Error (Bousquet et al. (2020))). We employ θ(S) to represent the model obtained when applying algorithm A to the dataset S. Under the uniform stability condition (3) with parameter ϵ and the uniform boundedness of the loss function ℓ( , ) M, we have that for any δ (0, 1), with a probability of at least 1 δ:

L (θ(S)) Ltr (θ(S)) = O

ϵ log n log 1

Proof of lemma 5. For the sake of brevity in the proof, we omit the hyperparameters λ. Consider two samples, S and S , each consisting of n data points, differing by only a single example. As we run SGD on these samples, we generate gradient updates θ1, . . . , θN and θ 1, . . . , θ N for samples S and

Published as a conference paper at ICLR 2024

S , respectively. Now, let s assume, without loss of generality, that the difference between S and S is solely attributed to the first example. Applying the Lipschitz condition to ℓ( , z), we obtain:

|ℓ(θN, z) ℓ(θ N, z)| KδN, (B.3)

where δN = θN θ N . It s worth noting that during the j-th iteration, there is a probability of (n 1)/n that the sample selected by SGD is the same in both sets S and S . The properties of convexity and γ-smoothness, as outlined in (Hardt et al. (2016)), establish that for any pair of θ and θ , the following holds:

ℓ(θ, z) ℓ(θ , z), θ θ 1

γ ℓ(θ, z) ℓ(θ , z) 2.

In the event that jt is not equal to 1, based on the inequality ηj 2/γ we can deduce that: θj+1 θ j+1 2

= θj θ j 2 2ηj ℓ(θj, zjt) ℓ θ j, z jt , θj θ j + η2 j ℓ(θj, zjt) ℓ θ j, z jt 2

θj θ j 2 2ηj

ℓ(θj, zjt) ℓ θ j, z jt 2

θj θ j 2 . (B.4)

In the scenario where the probability of selecting a different example is 1/n, i.e., jt = 1, we can utilize the triangle inequality and the previously mentioned inequality to establish the following: θj+1 θ j+1

= θj ηj ℓ(θj, zjt) θ j ηj ℓ θ j, zjt ) + ηj ℓ θ j, z jt ℓ θ j, zjt

θj θ j + 2ηj K.

By merging the two cases mentioned above, we can conclude that for every j: θj+1 θ j+1 θj θ j + 2ηj KI[jt=1],

where I represents the indicator function. By solving the recursive inequality, we obtain:

θj+1 θ j+1 2K

j=1 ηj I[jt=1] 2Kη

j=1 I[jt=1]. (B.5)

Invoking Lemma 7 whereby Xj = I[jt=1] and µ = N/n, we obtain the subsequent inequality that holds with a probability of at least 1 δ:

j=1 I[jt=1] N/n + log(1/δ) + p

2Nn 1 log(1/δ).

Consequently, with a probability at least 1 δ, the following inequality is satisfied: θj+1 θ j+1 2Kη(N/n + log(1/δ) + p

2Nn 1 log(1/δ)).

Then we derive the ensuing inequality with a probability of at least 1 δ:

θ(S) θ(S ) 2Kη(N/n + log(1/δ) + p

2Nn 1 log(1/δ)).

Substituting the inequality back into Eq. B.3, we arrive at the following result holding with a probability of at least 1 δ:

|ℓ(θ(S); z) ℓ(θ(S ); z)| 2K2η(N/n + log(1/δ) + p

2Nn 1 log(1/δ)). (B.6)

Subsequently, we can bound Term 5 by applying lemma 8 with a stability parameter ϵ = 2K2η(N/n+ log(1/δ) + p

2Nn 1 log(1/δ)). With a probability of at least 1 δ, we obtain:

L λ, θN λ Ltr λ, θN λ , Str = O ηNn 1 log n log (1/δ) + M p

n 1 log (1/δ) . (B.7)

Published as a conference paper at ICLR 2024

Next, we begin by addressing Term 4, denoted as Lval(λ, θN λ , Sval) L(λ, θN λ ), where Sval = {zval i }m i=1. Notably, θN λ is obtained by training on Str, which is independent of Sval. Then, we have:

L λ, θN λ = E[Lval λ, θN λ , Sval ].

Furthermore, since ℓ(λ, θ, z) is uniformly bounded by M, we can observe that for Sval and Sval

differing by only one example, we have

Lval λ, θN λ , Sval Lval λ, θN λ , Sval = 1

i=1 ℓ λ, θN λ , zval i

i=1 ℓ λ, θN λ , zval i !

Then we can apply lemma 6 with c = M

m to obtain, with probability at least 1 δ,

|Lval λ, θN λ , Sval L λ, θN λ | M

1 2m log(2/δ). (B.8)

By substituting inequalities B.8 and B.7 into equation B.2, we can establish that, with a probability of at least 1 δ,

Lval λ, θN λ , Sval Ltr λ, θN λ , Str

= O ηNn 1 log n log(1/δ) + M p

n 1 log(1/δ) + M p

m 1 log(2/δ) .

By the choice of N n m and η N 1

2 , we can deduce that, with a probability of at least 1 δ,

Lval λ, θN λ , Sval Ltr λ, θN λ , Str = O N 1

2 log N . (B.9)

Upper Bounding Term 2

We introduce the following lemma to bound Term 2, which quantifies the optimization error. Lemma 9 (Optimization Error (Lei & Tang, 2018)). Assuming that the function ℓ(θ) is convex and L-Lipschitz continuous with respect to θ, we run SGD with step sizes ηt = η 1

N . Then, with a probability of at least 1 δ, we can establish that:

Ltr(θ) inf θ Ltr(θ) = O(N 1

2 log 3 2 (N/δ)).

Lemma 10. Select δ (0, 1). Set the step size of SGD as ηt = η 1

N . Then, with a probability of 1 δ, we achieve:

Ltr λ, θN λ , Str Ltr λ, θ λ, Str = O N 1

2 log 3 2 (N/δ) .

Proof of Lemma 10. We should take note that, based on equality 3.1, with θ λ = arg min θ Θ L(λ, θ), we

can derive Ltr λ, θ λ, Str inf θ Ltr(λ, θ, Str).

From lemma 9, we can obtain

Ltr λ, θN λ , Str inf θ Ltr(λ, θ, Str) = O N 1

2 log 3 2 (N/δ) .

Combining the above two inequalities, we derive

Ltr λ, θN λ , Str Ltr λ, θ λ, Str

= Ltr λ, θN λ , Str inf θ Ltr λ, θ, Str + inf θ Ltr λ, θ, Str Ltr λ, θ λ, Str

2 log 3 2 (N/δ) .

Published as a conference paper at ICLR 2024

Upper Bounding Term 3 Lemma 11. Choose a value for δ from the interval (0, 1). Then, with a probability of 1 δ, we obtain: Ltr λ, θ λ, Str L (λ, θ λ) = O N 1

Proof of lemma 11. It is worthy of note that θ λ is independent of Str. Moreover, as ℓ(λ, θ, z) is uniformly bounded by M, we can discern that for Str and Str differing by only one example, there exists:

Ltr λ, θ λ, Str Ltr λ, θ λ, Str = 1

i=1 ℓ λ, θ λ, ztr i

i=1 ℓ λ, θ λ, ztr i !

Consequently, we can invoke Lemma 6 with c = M

n to attain, with a probability of at least 1 δ:

Ltr λ, θ λ, Str L (λ, θ λ) M

1 2n log(2/δ).

By the choice of N n, we can deduce that, with a probability of at least 1 δ,

Ltr λ, θ λ, Str L (λ, θ λ) = O N 1

Through the aforementioned handling of Terms 1, 2, and 3, we can ultimately derive the excess risk bound that we are interested in.

Proof of Theorem 1. Combining Term 1 in lemma 5, Term 2 in lemma 10 and Term 3 in lemma 11 with the decomposition B.1, by the choice of N n m and η N 1

2 , we can deduce that, with a probability of at least 1 δ,

Lval λ, θN λ , Sval L (λ, θ λ) =

O K2 log N log(1/δ) + M p

log(1/δ) + M p

log(2/δ) + log3/2(N/δ) N 1/2

When we neglect constants like K, M, and δ, we can derive the result:

Lval λ, θN λ , Sval L (λ, θ λ) = O N 1

2 log 3 2 N .

The proof is complete.

C PROOF OF LEMMA 2

In previous works, the assumptions regarding the noise εt differ from those in our work. Srinivas et al. (2010) and Nguyen & Gupta (2017) assume that the noise is uniformly bounded by σ, whereas Chowdhury & Gopalan (2017) and Gupta et al. (2022) assume that the noise is conditionally R-sub Gaussian. Furthermore, most prior research assumes that the noise sequence {εt}T t=1 is a martingale difference sequence. In contrast, our noise follows εt = Lval λt, θN λt, Sval L λt, θ λt , which clearly does not align with the assumptions made in previous works. Consequently, we cannot directly apply theorems similar to those in previous research that express the concentration of the posterior mean around the true objective function.

We follow the framework in Srinivas et al. (2010) to prove lemma 2.

Proof of lemma 2. Let s revisit the posterior mean function denoted as µt( ) and the posterior covariance function denoted as kt( , ) as discussed in Section 3.2.2, taking into account the data points (λi, yi) for i = 1, ..., t. It becomes apparent that the RKHS norm associated with the kernel kt can be expressed as:

L( , θ ) 2 kt = L( , θ ) 2 k + σ 2 t X

i=1 L λi, θ λi 2 . (C.1)

Published as a conference paper at ICLR 2024

This suggests that Hk(Λ) and Hkt(Λ) are equivalent for any t. Given that the reproducing property holds, i.e., L( , θ ), kt( , λ) kt = L(λ, θ λ) for any L( , θ ) Hkt(Λ), then,

|µt(λ) L(λ, θ λ)| kt(λ, λ)1/2 µt L( , θ ) kt = σt(λ) µt L( , θ ) kt . (C.2)

Recall that the posterior mean function is given by:

µt(λ) = kt(λ)T Kt + σ2I 1 yt.

Then, we also have:

µt, L( , θ ) k = Lt(λ, θ λ)T Kt + σ2I 1 yt, (C.3)

where Lt(λ, θ λ) = [L(λ1, θ λ1), . . . , L(λt, θ λt)]T . Additionally, the RKHS norm of µt is given by:

µt 2 k = y T t Kt + σ2I 1 yt σ2 Kt + σ2I 1 yt 2. (C.4)

Moreover, for i t, we obtain:

µt(λi) = δT i,t Kt Kt + σ2I 1 yt

= δT i,t(Kt + σ2I) Kt + σ2I 1 yt σ2δT i,t Kt + σ2I 1 yt

= yi σ2δT i,t Kt + σ2I 1 yt. (C.5)

where δi,t represents a t-dimensional vector where only the i-th element is equal to 1, while all other elements are set to 0. Then, referring to Equation C.1, we can derive:

µt L( , θ ) 2 kt = µt L( , θ ) 2 k + σ 2 t X

i=1 (µt(λi) L λi, θ λi )2.

= µt 2 k + L( , θ ) 2 k 2 µt, L( , θ ) k + σ 2 t X

i=1 (µt(λi) L λi, θ λi )2.

The noise sequence is εt = {εi}t i=1 with εi = yi L λi, θ λi . Taking into account equation C.5, we can express this as follows:

i=1 (µt(λi) L λi, θ λi )2 = σ 2 t X

i=1 (εi σ2δT i,t Kt + σ2I 1 yt)2

σ2 Kt + σ2I 1 yt 2 2εT t Kt + σ2I 1 yt + σ 2 εt 2

Considering L( , θ ) k is bounded by B, the upper bound of the noise from Theorem 1 and the positive definiteness of Kt + σ2I, we can substitute equations C.3, C.4, and C.7 into equation C.6 to derive, with a probability at least 1 δ:

µt L( , θ ) 2 kt = L( , θ ) 2 k y T t Kt + σ2I 1 yt + σ 2 εt 2

B2 + σ 2tφ2(N)N 1. (C.8)

Combining inequalities C.2 and C.8, we obtain that

|µt(λ) L (λ, θ λ) | p

B2 + σ 2tφ2(N)N 1σt(λ).

The proof is completed.

Published as a conference paper at ICLR 2024

D PROOF OF THEOREM 3

First, we will bound the instantaneous regret, given by rt = L λ+ t , θN λ+ t

L (λ , θ λ ). Next, we

will aim to find an upper bound for the sum RT = PT t=1 rt.

To derive an upper bound for rt, we decompose it into three constituent terms as follows:

rt = L λ+ t , θN λ+ t

L (λ , θ λ )

= L λ+ t , θN λ+ t

L λ+ t , θ λ+ t

| {z } Term 6

+ L λ+ t , θ λ+ t

µ+ t | {z } Term 7

+ µ+ t L (λ , θ λ ) | {z } Term 8

We introduce the function τ(z) = zΦ(z) + ϕ(z), where Φ and ϕ represent the cumulative distribution function (c.d.f.) and probability density function (p.d.f.) of the standard normal distribution, respectively. Now, we define the function It(λ) as follows: It(λ) = max µ+ t L (λ, θ λ) , 0 . In this expression, It(λ) assumes positive values when the prediction is lower than the minimum GP mean observed so far up to that point. For all other cases, It(λ) is set to zero. The process of finding the new query point involves maximizing the expected improvement:

λt = argmaxλ E (It(λ)) .

Upper Bounding Term 6. Lemma 12. Choose δ (0, 1) and N m. Then with probability at least 1 δ, we have

L λ+ t , θN λ+ t

L λ+ t , θ λ+ t

where α(N) = O log 3 2 N .

Proof of lemma 12. We decompose Term 6 into two constituent terms.

L λ+ t , θN λ+ t

L λ+ t , θ λ+ t

= L λ+ t , θN λ+ t

Lval(λ+ t , θN λ+ t , Sval) + Lval(λ+ t , θN λ+ t , Sval) L λ+ t , θ λ+ t

In regard to L λ+ t , θN λ+ t

Lval λ+ t , θN λ+ t , Sval , it is noteworthy that θN λ+ t is derived through

training on the dataset Str, which remains independent of the dataset Sval. Consequently, we can establish the following relationship:

L λ+ t , θN λ+ t

= E h Lval λ+ t , θN λ+ t , Sval i .

Moreover, given that ℓ(λ, θ, z) is uniformly bounded by M, when considering datasets Sval and Sval , which differ by only a single example, the following assertion can be made:

Lval λ+ t , θN λ+ t , Sval Lval λ+ t , θN λ+ t , Sval

i=1 ℓ λ+ t , θN λ+ t , zval i

i=1 ℓ λ+ t , θN λ+ t , zval i !

Then we can apply lemma 6 with c = M

m to obtain, with probability at least 1 δ,

Lval λ+ t , θN λ+ t , Sval L λ+ t , θN λ+ t

1 2m log(2/δ).

As for the second term Lval λ+ t , θN λ+ t , Sval L λ+ t , θ λ+ t

, we can apply the result from Theorem 1 to obtain: Lval λ+ t , θN λ+ t , Sval L λ+ t , θ λ+ t

Published as a conference paper at ICLR 2024

where φ(N) = O log 3 2 N . Then, combining the above two inequalities and choosing N m, we can deduce that, with a probability of at least 1 δ,

L λ+ t , θN λ+ t

L λ+ t , θ λ+ t

where α(N) = O log 3 2 N . The proof is complete.

Upper Bounding Term 7.

When addressing Terms 7 and 8, we adopt the proof framework outlined in Gupta et al. (2022), with the ability to adapt and apply certain lemmas from Gupta et al. (2022) to our proof following appropriate adjustments.

Lemma 13. (Gupta et al. (2022)) Choose a value for δ from the interval (0, 1). Then, with a probability of 1 δ, we obtain

L λ+ t , θ λ+ t

2π (βt + 1) σt 1 (λt) +

Upper Bounding Term 8.

Initially, we establish both the lower and upper bounds for the acquisition function. The following lemma draws inspiration from Lemma 9 in Wang & de Freitas (2014).

Lemma 14. Choose δ (0, 1). For λ Λ, t T, define It(λ) = max 0, µ+ t L (λ, θ λ) . Then with probability at least 1 δ we have

It(λ) βtσt 1(λ) EIt(λ) It(λ) + (βt + 1) σt 1(λ).

Proof of lemma 14. If σt 1(λ) = 0, then EIt(λ) = It(λ) = 0, which leads to a trivial result. Now,

let s assume that σt 1(λ) > 0. Define q = µ+ t L(λ,θ λ) σt 1(λ) and u = µ+ t µt 1(λ)

σt 1(λ) . We can then establish that: EIt(λ) = σt 1(λ)τ (u) .

According to lemma 2, we can conclude that |u q| βt with a probability of at least 1 δ. Since τ (z) > 0, τ is a non-decreasing function, and for z > 0, τ(z) 1 + z. Therefore, we have:

EIt(λ) σt 1(λ)τ (max{0, q} + βt) σt 1(λ) (max{0, q} + βt + 1) = It(λ) + (βt + 1) σt 1(λ).

If It(λ) = 0, the lower bound is straightforward as EIt(λ) is non-negative. Let s consider the case where It(λ) > 0. Since EIt(λ) 0 and τ(z) 0 for all z, and τ(z) = z + τ( z) z. Therefore, we can write:

EIt(λ) σt 1(λ)τ (q βt) σt 1(λ) (q βt) = It(λ) βtσt 1(λ).

The proof is now complete.

Lemma 15. Choose δ (0, 0.5). Then, with a probability of at least 1 2δ, we obtain:

µ+ t L (λ , θ λ ) 1 + βt τ(βt) βt

(It (λt) + (βt + 1) σt 1 (λt)) .

Proof of lemma 15. If σt 1(λ ) = 0, as per the definition of EIt(λ), we find that EIt(λ ) = It(λ ) = 0, which leads to a trivial result. Now, let s consider the case where σt 1(λ ) > 0. If L (λ , θ λ ) > µ+ t , the lemma remains trivial. We will now explore the case where L (λ , θ λ ) µ+ t .

Published as a conference paper at ICLR 2024

Based on lemma 2, we know that L(λ , θ λ ) µt 1(λ ) βtσt 1(λ ) with a probability of 1 δ. Combining this with the fact that L(λ , θ λ ) µ+ t , we can conclude that with a probability of 1 δ, the following holds: µ+ t µt 1 (λ )

σt 1 (λ ) βt.

Recalling the definition of the EI acquisition function, with a probability of at least 1 δ, we can conclude that:

EIt (λ ) = σt 1 (λ ) τ µ+ t µt 1 (λ )

σt 1 (λ ) τ ( βt) .

Combining the above inequality with the previously established result EIt (λ ) It (λ ) βtσt 1 (λ ) as proven in Lemma 14, we obtain:

( βt τ( βt) + 1)EIt (λ ) It (λ ) .

Using the above inequality in conjunction with the fact that τ(z) = z + τ( z) for z = βt, we obtain

It (λ ) τ(βt) τ(βt) βt

EIt (λ ) 1 + βt τ(βt) βt

EIt (λ ) . (D.3)

In accordance with Algorithm 1, λt = argmax EIt(λ). When combined with the upper bound from Lemma 14, we can deduce:

EIt (λ ) EIt (λt) It (λt) + (βt + 1) σt 1 (λt) (D.4)

Utilizing the inequalities D.3 and D.4, we obtain

µ+ t L (λ , θ λ ) It (λ )

1 + βt τ(βt) βt

1 + βt τ(βt) βt

(It (λt) + (βt + 1) σt 1 (λt)) . (D.5)

The proof of lemma 15 is complete.

Upper bounding the immediate regret. rt = L λ+ t , θN λ+ t

L (λ , θ λ ). Combining lemma 12 for Term 6, lemma 13 for Term 7 and lemma 15 for Term 8, we can conclude with high probability that:

rt 1 + βt τ(βt) βt +

[It(λt) + (βt + 1) σt 1 (λt)] + α(N)N 1

Upper bounding the cumulative regret. RT = PT t=1 rt. Based on the above inequality, we can derive an upper bound for the cumulative regret as follows:

t=1 rt 1 + βT τ(βT ) βT +

| {z } Term 9

t=1 σt 1 (λt)

| {z } Term 10

We now introduce lemma 16 and lemma 17 from Gupta et al. (2022) and Chowdhury & Gopalan (2017) to assist in bounding Term 9 and Term 10. Lemma 16. Gupta et al. (2022) Select δ from the interval (0, 1). Then, with a probability of at least 1 δ, we can establish that: T X

t=1 It(λt) = O βT p

Published as a conference paper at ICLR 2024

Lemma 17. Chowdhury & Gopalan (2017) Let λ1, . . . , λt represent the points selected by algorithm 1. The sum of the predictive standard deviations at these points can be expressed in terms of the maximum information gain. To be more precise:

t=1 σt 1 (λt) = O p

Ultimately, based on the above analysis, we can demonstrate Theorem 3 as follows.

Proof of Theorem 3. By combining the results in Lemma 16 and Lemma 17 with Inequality D.7, we derive an upper bound for the cumulative regret PT t=1 rt as follows:

t=1 rt = O β2 T TγT τ(βT ) βT + TN 1

where βT = p

B2 + σ 2Tφ2(N)N 1. If we select N T, we can deduce that:

t=1 rt = O p

TγT . (D.9)

E PROOF OF THEOREM 4

In this section, we present the proof for Theorem 4.

Proof of Theorem 4. To derive an upper bound for rt, we decompose it into two constituent terms as follows:

rt = L λ+ t , θN λ+ t

L (λ , θ λ )

= L λ+ t , θN λ+ t

L λ+ t , θ λ+ t

| {z } Term 11

+ L λ+ t , θ λ+ t

L (λ , θ λ ) | {z } Term 12

From inequality D.2, we can derive an upper bound for Term 11:

L λ+ t , θN λ+ t

L λ+ t , θ λ+ t

Where α(N) = O(log3/2 N). Notice that at each round t 1, based on the selection of λ+ t in Algorithm 2, i.e., λ+ t = argmaxλ ( µt 1(λ) + βtσt 1(λ)). we have

µt 1(λ ) + βtσt 1(λ ) µt 1(λ+ t ) + βtσt 1(λ+ t ).

From lemma 2, we have

L (λ , θ λ ) µt 1(λ ) + βtσt 1(λ ),

L λ+ t , θ λ+ t

µt 1(λ+ t ) βtσt 1(λ+ t ).

Utilizing the above three inequalities, we can derive an upper bound for Term 12:

L λ+ t , θ λ+ t

L (λ , θ λ ) L λ+ t , θ λ+ t

µt 1(λ ) + βtσt 1(λ )

L λ+ t , θ λ+ t

µt 1(λ+ t ) + βtσt 1(λ+ t )

2βtσt 1(λ+ t ). (E.3)

Published as a conference paper at ICLR 2024

By combining the upper bound E.2 for Term 11 and the upper bound E.3 for Term 12 with Equation E.1, we obtain

t=1 σt 1 λ+ t + Tα(N)N 1

From lemma 17, we have PT t=1 σt 1 λ+ t = O TγT . Additionally, according to the definition, we have βT = p

B2 + σ 2Tφ2(N)N 1. Hence with probability at least 1 δ,

(B2 + TN 1)TγT + TN 1

If we select N T, we can obtain that:

TγT . (E.5)

F EXTENSIONS

In this section, we present some extensions of our analyses. We consider two extensions: extension to non-convex scenarios and extension to non-smooth scenarios.

F.1 EXTENSION TO NON-CONVEX SCENARIOS

In this section, we extend Theorem 1 to non-convex loss functions as follows. Theorem 18. Suppose that the function ℓ(λ, θ, z) is K-Lipschitz and γ-smooth with respect to θ, uniformly bounded by M, and also satisfies 1

2 θℓ(λ, θ, z) 2 µ(ℓ(λ, θ, z) ℓ(λ, θ λ, z)), µ > 0. We perform SGD with step sizes ηj = 2j+1 2µ(j+1)2 on a sample Str drawn from the distribution D at

the inner level. Let Sval = zval i m i=1 represent the validation set drawn from the distribution D. Suppose γ Nµ/4. Choose N n m. Then, with a probability of at least 1 δ, we have:

Lval(λ, θN λ , Sval) L(λ, θ λ) = O N 1

As a comparison, if ℓ(λ, θ, z) is convex and ηj N 1

2 , then we have:

Lval(λ, θN λ , Sval) L(λ, θ λ) = O N 1

Proof of Theorem 18. To establish an upper bound of the excess risk of the inner-level SGD, denoted as Lval λ, θN λ , Sval L (λ, θ λ), we introduce a highly effective decomposition method:

Lval λ, θN λ , Sval L (λ, θ λ) = Lval λ, θN λ , Sval L(λ, θN λ ) + L(λ, θN λ ) L (λ, θ λ) . (F.1) As elucidated in Remark 5 of Lei & Ying (2021), based on the pointwise hypothesis stability analysis and the optimization error bound in Karimi et al. (2016), it was shown with probability at least 1 δ (Charles & Papailiopoulos (2018)) that

L(λ, θN λ ) L (λ, θ λ) = O 1 nµδ + 1

N 1 4 µ 3 4 δ 1 2

From the equation B.7 in the proof of Theorem 1 in Section B, we obtain

L λ, θN λ Ltr λ, θN λ , Str = O ηNn 1 log n log (1/δ) + M p

n 1 log (1/δ) . (F.3)

Selecting N m n, we obtain

Lval(λ, θN λ , Sval) L(λ, θ λ) = O N 1

The proof is complete.

Practical insights: Inner Unit Horizon as the Square of Outer-Level Iterations for Non-convex functions. Unlike the case with convex, for the non-convex situation, the optimal number of innerlevel SGD optimization iterations is N T 2. Hence, we attained the regret bound RT = O( TγT ). .

Published as a conference paper at ICLR 2024

F.2 EXTENSION TO NON-SMOOTH SCENARIOS

In this section, we extend Theorem 1 to non-smooth loss functions as follows. We remove the smoothness assumption by introducing an additional term η

N in the excess risk bound. Moreover, we adopt alternative criteria, including weaker conditions, such as defining α-Hölder continuity as: Definition 4. Let γ > 0, α (0, 1]. We say θℓis α-Hölder continuous if for any θ1, θ2, z, λ,

θℓ(λ, θ1, z) θℓ(λ, θ2, z) γ θ1 θ2 α 2 .

The theorem is updated as follows

Theorem 19. Suppose that the function ℓ(λ, θ, z) is K-Lipschitz continuous, and convex with respect to θ, uniformly bounded by M. Let p > 1

2. We perform SGD with step sizes ηj = η N p on a sample Str drawn from the distribution D at the inner level. Let Sval = zval i m i=1 represent the validation set drawn from the distribution D. Choose N n m. Then, with a probability of at least 1 δ, we have:

Lval(λ, θN λ , Sval) L(λ, θ λ) = O N 1 2 p log(N) + N p log(N) + N 1

Furthermore, if ℓ(λ, θ, z) is α-Hölder continuous, then we have:

Lval(λ, θN λ , Sval) L(λ, θ λ) = O N 1 p 1 α log(N) + N p log(N) + N 1

As a comparison, if ℓ(λ, θ, z) is smooth, then we have:

Lval(λ, θN λ , Sval) L(λ, θ λ) = O N p log(N) + N 1

Proof of Theorem 19. Similar to the proof of Theorem 1 in Section B, for establishing an upper bound on the excess risk Lval(λ, θN λ , Sval) L(λ, θ λ), we employ an effective decomposition method:

Lval λ, θN λ , Sval L (λ, θ λ) = Lval λ, θN λ , Sval Ltr λ, θN λ , Str

| {z } Term 9 + Ltr λ, θN λ , Str Ltr λ, θ λ, Str

| {z } Term 10

+ Ltr λ, θ λ, Str L (λ, θ λ) | {z } Term 11

Regarding Term 10 and Term 11, based on Lemma 10 and Lemma 11, we derive:

Ltr λ, θN λ , Str Ltr λ, θ λ, Str + Ltr λ, θ λ, Str L (λ, θ λ)

2 + N p 1) log(N)

2 log(N) . (F.4)

For Term 9, we decompose it as follows:

Lval λ, θN λ , Sval Ltr λ, θN λ , Str = Lval λ, θN λ , Sval L(λ, θN λ )+L(λ, θN λ ) Ltr λ, θN λ , Str .

As demonstrated in B.8, we utilize Lemma 6 with c = M

m to acquire, with a probability of at least 1 δ,

|Lval λ, θN λ , Sval L λ, θN λ | M

1 2m log(2/δ). (F.5)

Choose N n m, we obtain:

Lval λ, θN λ , Sval L λ, θN λ = O N 1

Concerning the generalization error L(λ, θN λ ) Ltr λ, θN λ , Str , by combining Lemma 8 with Theorem 3.3 in Bassily et al. (2020), we obtain, with a probability of at least 1 δ,

L(λ, θN λ ) Ltr λ, θN λ , Str = O N 1 2 p log N + N p log N + N 1

Published as a conference paper at ICLR 2024

Then, we obtain:

Lval(λ, θN λ , Sval) L(λ, θ λ) = O N 1 2 p log(N) + N p log(N) + N 1

Furthermore, if ℓ(λ, θ, z) is α-Hölder continuous, then by combining Lemma 8 with Proposition G.1. in Lei & Ying (2020), we obtain, with a probability of at least 1 δ,

L(λ, θN λ ) Ltr λ, θN λ , Str = O N 1 p 1 α log N + N p log N + N 1

Then, we obtain:

Lval(λ, θN λ , Sval) L(λ, θ λ) = O N 1 p 1 α log(N) + N p log(N) + N 1

The proof is complete.