# functional_bilevel_optimization_for_machine_learning__26bd2c0b.pdf

Functional Bilevel Optimization for Machine Learning

Ieva Petrulionyte, Julien Mairal, Michael Arbel Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France firstname.lastname@inria.fr

In this paper, we introduce a new functional point of view on bilevel optimization problems for machine learning, where the inner objective is minimized over a function space. These types of problems are most often solved by using methods developed in the parametric setting, where the inner objective is strongly convex with respect to the parameters of the prediction function. The functional point of view does not rely on this assumption and notably allows using over-parameterized neural networks as the inner prediction function. We propose scalable and efficient algorithms for the functional bilevel optimization problem and illustrate the benefits of our approach on instrumental regression and reinforcement learning tasks.

1 Introduction

Bilevel optimization methods solve problems with hierarchical structures, optimizing two interdependent objectives: an inner-level objective and an outer-level one. Initially used in machine learning for model selection [Bennett et al., 2006] and sparse feature learning [Mairal et al., 2012], these methods gained popularity as efficient alternatives to grid search for hyper-parameter tuning [Feurer and Hutter, 2019, Lorraine et al., 2019, Franceschi et al., 2017]. Applications of bilevel optimization include meta-learning [Bertinetto et al., 2019], auxiliary task learning [Navon et al., 2021], reinforcement learning [Hong et al., 2023, Liu et al., 2021a, Nikishin et al., 2022], inverse problems [Holler et al., 2018] and invariant risk minimization [Arjovsky et al., 2019, Ahuja et al., 2020].

Bilevel problems are challenging to solve, even in the well-defined bilevel setting with a unique inner-level solution. This difficulty stems from approximating both the inner-level solution and its sensitivity to the outer-level variable during gradient-based optimization. Methods like Iterative Differentiation (ITD, Baydin et al., 2017) and Approximate Implicit Differentiation (AID, Ghadimi and Wang, 2018) were designed to address these challenges in the well-defined setting, resulting in scalable algorithms with strong convergence guarantees [Domke, 2012, Gould et al., 2016, Ablin et al., 2020, Arbel and Mairal, 2022a, Blondel et al., 2022, Liao et al., 2018, Liu et al., 2022b, Shaban et al., 2019]. These guarantees usually require the inner-level objective to be strongly convex. However, when the inner-level variables are neural network parameters, the lower-level problem becomes non-convex and may have multiple solutions due to over-parameterization. While nonconvexity is considered "benign" in this setting [Allen-Zhu et al., 2019, Liu et al., 2022a], multiplicity of inner-level solutions makes their dependence on the outer-level variable ambiguous [Liu et al., 2021b], posing a major challenge in bilevel optimization for modern machine learning applications.

We identify a common functional structure in bilevel machine learning problems to address the ambiguity challenge that arises with flexible models like neural networks. Specifically, we consider a prediction function h optimized by the inner-level problem over a Hilbert space H. This space consists of functions defined over an input space X and taking values in a finite dimensional vector space V. The optimal prediction function is then evaluated in the outer-level to optimize an outer-level parameter ω in a finite dimensional space Ω= Rd, resulting in a functional bilevel problem:

min ω ΩF(ω) : = Lout (ω, h ω) s.t. h ω = arg min h H Lin (ω, h) . (FBO)

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Figure 1: Parametric vs functional approaches for solving FBO by implicit differentiation.

In contrast to classical bilevel formulations involving neural networks, where the inner objective is non-convex with respect to the network parameters, the inner objective in (FBO) defines an optimization problem over a prediction function h in a functional vector space H.

A crucial consequence of adopting this new viewpoint is that it renders the strong convexity of the inner objective with respect to h a mild assumption, which ensures the uniqueness of the solution h ω for any outer parameter value ω. Strong convexity with respect to the prediction function is indeed much weaker than strong convexity with respect to model parameters and often holds in practice. For instance, a supervised prediction task with pairs of features/labels (x, y) drawn from some training data distribution is formulated as a regularized empirical minimization problem:

min h H Lin(ω, h) := Ex,y y h(x) 2 2 + ω

2 h 2 H, (1)

where H is the L2 space of square integrable functions w.r.t. the distribution of x, and ω > 0 controls the amount of regularization. The inner objective is strongly convex in h, even though the optimal prediction function h ω can be highly nonlinear in x. The function h ω may then be approximated, e.g., by an overparameterized deep neural network, used here as a function approximation tool.

Although appealing, the (FBO) formulation necessitates the development of corresponding theory and algorithms, which is the aim of our paper. To the best of our knowledge, this is the first work to propose a functional bilevel point of view that can leverage deep networks for function approximation. The closest works are either restricted to kernel methods [Rosset, 2008, Kunapuli et al., 2008] and thus cannot be used for deep learning models, or propose abstract algorithms that can only be implemented for finite Hilbert spaces [Suonperä and Valkonen, 2024].

We introduce in Section 2 a theoretical framework for functional implicit differentiation in an abstract Hilbert space H that allows computing the total gradient F(ω) using a functional version of the implicit function theorem [Ioffe and Tihomirov, 1979] and the adjoint sensitivity method [Pontryagin, 2018]. This involves solving a well-conditioned functional linear system, equivalently formulated as a regression problem in H, to find an adjoint function a ω used for computing the total gradient F(ω). We then specialize this framework to the common scenario where H is an L2 space and objectives are expectations of point-wise losses. This setting covers many machine learning problems (see Sections 4.1, 4.2, and Appendix A). In Section 3, we propose an efficient algorithm where the prediction and adjoint functions can be approximated using parametric models, like neural networks, learned with standard stochastic optimization tools. We further study its convergence using analysis for biased stochastic gradient descent [Demidovich et al., 2024].

The proposed method, called functional implicit differentiation (Func ID), adopts a novel "differentiate implicitly, then parameterize" approach (left Figure 1): functional strong convexity is first exploited to derive an unambiguous implicit gradient in function space using a well-defined adjoint function. Then, both the lower-level solution and adjoint function are approximated using neural networks. This contrasts with traditional AID/ITD approaches (right Figure 1), which parameterize the inner-level solution as a neural network, leading to a non-convex parametric bilevel problem in the network s parameters. An ambiguous implicit gradient is then computed by approximately solving an unstable or ill-posed linear system [Arbel and Mairal, 2022b]. Consequently, Func ID addresses the ambiguity challenge by exploiting the functional perspective and results in a stable algorithm with reduced time and memory costs. In Section 4, we demonstrate the benefits of our approach in instrumental regression and reinforcement learning tasks, which admit a natural functional bilevel structure.

Related work on bilevel optimization with non-convex inner objectives. In principle, considering amended versions of the bilevel problem can resolve the ambiguity arising from non-convex inner

objectives. This is the case of optimistic/pessimistic versions of the problem, often considered in the literature on mathematical optimization, where the outer-level objective is optimized over both outer and inner variables, under the optimality constraint of the inner-level variable [Dempe et al., 2007, Ye and Ye, 1997, Ye and Zhu, 1995, Ye et al., 1997]. While tractable methods were recently proposed to solve them [Liu et al., 2021a,b, 2023, Kwon et al., 2024, Shen and Chen, 2023], it is unclear how well the resulting solutions behave on unseen data in the context of machine learning. For instance, when using over-parameterized models for the inner-level problem, their parameters must be further optimized for the outer-level objective, possibly resulting in over-fitting [Vicol et al., 2022]. More recently, Arbel and Mairal [2022b] proposed a game formulation involving a selection map to deal with multiple inner-level solutions. Such a formulation justifies the use of ITD/AID outside the well-defined bilevel setting, by viewing those methods as approximating the Jacobian of the selection map. However, the justification only hold under rather strong geometric assumptions. Additional related work is discussed in Appendix B on bilevel optimization with strongly-convex inner objectives, the adjoint sensitivity method that is often used in the context of ordinary differential equations, and amortization techniques [Amos et al., 2023] that have been also exploited for approximately solving bilevel optimization problems [Mac Kay et al., 2019, Bae and Grosse, 2020].

2 A Theoretical Framework for Functional Bilevel Optimization

The functional bilevel problem (FBO) involves an optimal prediction function h ω for each value of the outer-level parameter ω. Solving (FBO) by using a first-order method then requires characterizing the implicit dependence of h ω on the outer-level parameter ω to evaluate the total gradient F(ω) in Rd. Indeed, assuming that h ω and Lout are Fréchet differentiable (this assumption will be discussed later), the gradient F(ω) may be obtained by an application of the chain rule:

F(ω) = gω + ωh ωdω, with gω := ωLout(ω, h ω) and dω := h Lout (ω, h ω) . (2)

The Fréchet derivative ωh ω : H Rd is a linear operator acting on functions in H and measures the sensitivity of the optimal solution on the outer variable. We will refer to this quantity as the Jacobian in the rest of the paper. While the expression of the gradient in Equation (2) might seem intractable in general, we will see in Section 3 a class of practical algorithms to estimate it.

2.1 Functional implicit differentiation

Our starting point is to characterize the dependence of h ω on the outer variable. To this end, we rely on the following implicit differentiation theorem (proven in Appendix C) which can be seen as a functional version of the one used in AID [Domke, 2012, Pedregosa, 2016], albeit, under a much weaker strong convexity assumption that holds in most practical cases of interest. Theorem 2.1 (Functional implicit differentiation). Consider problem (FBO) and assume that:

For any ω Ω, there exists µ>0 such that h 7 Lin(ω , h) is µ-strongly convex for any ω near ω.

h 7 Lin(ω, h) has finite values and is Fréchet differentiable on H for all ω Ω.

h Lin is Hadamard differentiable on Ω H (in the sense of Definition C.1 in Appendix C).

Then, ω 7 h ω is uniquely defined and is Fréchet differentiable with a Jacobian ωh ω given by:

Bω + ωh ωCω = 0, with Bω := ω,h Lin(ω, h ω), and Cω := 2 h Lin(ω, h ω). (3)

The strong convexity assumption on the inner-level objective ensures the existence and uniqueness of the solution h ω, while differentiability assumptions on Lin and h Lin ensure Fréchet differentiability of the map ω 7 h ω. Though the implicit function theorem for Banach spaces [Ioffe and Tihomirov, 1979] could yield similar conclusions, it demands the stronger assumption that h Lin is continuously Fréchet differentiable, which is quite restrictive in our setting of interest: for instance, when H is an L2-space and Lin is an integral functional of the form Lin(ω, h) = R ℓin(w, h(x)) dx, with ℓin defined on Ω V and satisfying mild smoothness and growth assumptions on ℓin, then h 7 h Lin(ω, h) cannot be Fréchet differentiable with uniformly continuous differential on bounded sets, unless v 7 ℓin(ω, v) is a polynomial of degree at most 2 (see [Nemirovski and Semenov, 1973, Corollary 2, p 276] and discussions in [Noll, 1993, Goodman, 1971]). Instead, Theorem 2.1 employs the weaker notion of Hadamard differentiability for h Lin, widely used in statistics, particularly for

deriving the delta-method [van der Vaart and Wellner, 1996, Chapter 3.9]. Consequently, Theorem 2.1 allows us to cover a broader class of functional bilevel problems, as we see in Section 2.2.

Similarly to AID, only a Jacobian-vector product is needed when computing the total gradient F(ω). The result in Proposition 2.2 below, relies on the adjoint sensitivity method [Pontryagin, 2018] to provide a more convenient expression for F(ω) and is proven in Appendix C.2.

Proposition 2.2 (Functional adjoint sensitivity). Under the same assumption on Lin as in Theorem 2.1 and further assuming that Lout is jointly differentiable in ω and h, the total objective F is differentiable with F(ω) given by:

F(ω) = gω + Bωa ω, (4)

where the adjoint function a ω := C 1 ω dω is an element of H that minimizes the quadratic objective:

a ω = arg min a H Ladj(ω, a) := 1

2 a, Cωa H + a, dω H. (5)

Equation (4) indicates that computing the total gradient requires optimizing the quadratic objective (5) to find the adjoint function a ω. The strong convexity of the adjoint objective Ladj ensures the existence of a unique minimizer, and stems from the positive definiteness of its Hessian operator Cω due to the inner-objective s strong convexity. Both adjoint and inner-level problems occur in the same function space H and are equivalent in terms of conditioning, as the adjoint Hessian operator equals the inner-level Hessian at optimum. The strong convexity in H of the adjoint objective guarantees well-defined solutions and holds in many practical cases, as opposed to classical parametric bilevel formulations which require the more restrictive strong convexity condition in the model s parameters, and without which instabilities may arise due to ill-conditioned linear systems (see Appendix F.1).

2.2 Functional bilevel optimization in L2 spaces

Specializing the abstract results from Section 2.1 to a common scenario in machine learning, we consider both inner and outer level objectives of (FBO) as expectations of point-wise functions over observed data. Specifically, we have two data distributions P and Q defined over a product space X Y Rdx Rdy, and denote by H the Hilbert space of functions h : X V, where V = Rdv. Given an outer parameter space Ω, we address the following functional bilevel problem:

min ω ΩLout (ω, h ω) := EQ [ℓout (ω, h ω(x), x, y)]

s.t. h ω = arg min h H Lin (ω, h) := EP [ℓin (ω, h(x), x, y)] , (6)

where ℓout, ℓin are point-wise loss functions defined on Ω V X Y. This setting encompasses various deep learning problems discussed in Sections 4.1 and 4.2, and in Appendix A, representing a specific instance of FBO. The Hilbert space H of square-integrable functions not only models a broad range of prediction functions but also facilitates obtaining concrete expressions for the total gradient F(ω), enabling the derivation of practical algorithms in Section 3.

The following proposition, proved in Appendix D, makes mild technical assumptions on probability distributions P, Q and the point-wise losses ℓin, ℓout. It gives an expression for the total gradient in the form of expectations under P and Q.

Proposition 2.3 (Functional Adjoint sensitivity in L2 spaces.). Under the technical Assumptions (A) to (L) stated in Appendix D.1, the conditions on Lin and Lout in Proposition 2.2 hold and the total gradient F(ω) of F is expressed as F(ω) = gω + Bωa ω, with a ω H being the minimizer of the objective Ladj in Equation (5). Moreover, Ladj, gω and Bωa ω are given by:

Ladj(ω, a) = 1

2 EP a(x) 2 vℓin (ω, h ω(x), x, y) a(x)

+ EQ a(x) vℓout (ω, h ω(x), x, y) , (7)

gω = EQ [ ωℓout (ω, h ω(x), x, y)] , Bωa ω = EP [ ω,vℓin (ω, h ω(x), x, y) a ω(x)] , (8)

where ωℓout, vℓout, ω,vℓin, and 2 vℓin are partial first and second order derivatives of ℓout and ℓin with respect to their first and second arguments ω and v.

Assumptions (A) and (B) on P and Q ensure finite second moments and bounded Radon-Nikodym derivatives, maintaining square integrability under both distributions in Equation (6). Assumptions (C) to (L) on ℓin and ℓout primarily involve integrability, differentiability, Lipschitz continuity, and strong convexity of ℓin in its second argument, typically satisfied by objectives like mean squared error or cross entropy (see Proposition D.1 in Appendix D.1). Next, by leveraging Proposition 2.3, we derive practical algorithms for solving FBO using function approximation tools like neural networks.

3 Methods for Functional Bilevel Optimization in L2 Spaces

We propose Functional Implicit Differentiation (Func ID), a flexible class of algorithms for solving the functional bilevel problem in L2 spaces described in Section 2.2 when samples from distributions P and Q are available.

Func ID relies on three main components detailed in the next subsections:

1. Empirical objectives. These approximate the objectives Lout, Lin and Ladj as empirical expectations over samples from inner and outer datasets Din and Dout. 2. Function approximation. The search space for both the prediction and adjoint functions is restricted to parametric spaces with finite-dimensional parameters θ and ξ. Approximate solutions ˆhω and ˆaω to the optimal functions h ω and a ω are obtained by minimizing the empirical objectives. 3. Total gradient approximation. Func ID estimates the total gradient F(ω) using the empirical objectives, and the approximations ˆhω and ˆaω of the prediction and adjoint functions.

Algorithm 1 provides an outlines of Func ID which has a nested structure similar to AID: (1) innerlevel optimizations (Inner Opt and Adjoint Opt) to update the prediction and adjoint models using scalable algorithms such as stochastic gradient descent [Robbins and Monro, 1951], and (2) an outer-level optimization to update the parameter ω using a total gradient approximation Total Grad. An optional warm-start allows initializing the parameters of both the prediction and adjoint models for the current outer-level iteration with those obtained from the previous one.

Algorithm 1 Func ID

Input: initial outer, inner, and adjoint parameter ω0, θ0, ξ0; warm-start option WS. for n = 0, . . . , N 1 do

# Optional warm-start if WS=True then (θ0, ξ0) (θn, ξn) end if # Inner-level optimization ˆhωn, θn+1 Inner Opt(ωn, θ0, Din) # Adjoint optimization ˆaωn, ξn+1 Adjoint Opt(ωn, ξ0, ˆhωn, D) # Outer gradient estimation Sample a mini-batch B = (Bout, Bin) from D = (Dout, Din) gout Total Grad(ωn, ˆhωn, ˆaωn, B) ωn+1 update ωn using gout; end for

3.1 From population losses to empirical objectives

We assume access to two datasets Din and Dout, comprising i.i.d. samples from P and Q, respectively. This assumption can be relaxed, such as when using samples from a Markov chain or a Markov Decision Process to approximate population objectives. For scalability, we operate in a minibatch setting, where batches B = (Bout, Bin) are sub-sampled from datasets D := (Dout, Din). Approximating both inner and outer level objectives in (6) can be achieved using empirical versions:

ˆLout (ω, h, Bout) := 1 |Bout| X

( x, y) Bout ℓout (ω, h( x), x, y) ,

ˆLin (ω, h, Bin) := 1 |Bin| X

(x,y) Bin ℓin (ω, h(x), x, y) .

Adjoint objective. Using the expression of Ladj from Proposition 2.3, we derive a finite-sample approximation of the adjoint loss by replacing the population expectations by their empirical counterparts. More precisely, assuming we have access to an approximation ˆhω to the inner-level prediction function, we consider the following empirical version of the adjoint objective:

ˆLadj ω, a, ˆhω, B := 1

2 1 |Bin| X

(x,y) Bin a(x) 2 vℓin(ω, ˆhω(x), x, y) a(x)

+ 1 |Bout| X

(x,y) Bout a(x) vℓout ω, ˆhω(x), x, y . (9)

The adjoint objective in Equation (9) requires computing a Hessian-vector product with respect to the output v in Rdv of the prediction function ˆhω, which is typically of reasonably small dimension, unlike traditional AID methods that necessitate a Hessian-vector product with respect to some model parameters. Importantly, compared to AID, Func ID does not requires differentiating twice w.r.t the model s parameters τ(θ) which results in memory and time savings as discussed in Appendix F.2.

3.2 Approximate prediction and adjoint functions

To find approximate solutions to the prediction and adjoint functions we rely on three steps: 1) specifying parametric search spaces for both functions, 2) introducing optional regularization to prevent overfitting and, 3) defining a gradient-based optimization procedure on the empirical objectives.

Parametric search spaces. We approximate both prediction and adjoint functions using parametric search spaces. A parametric family of functions defined by a map τ : Θ H over parameters Θ Rpin constrains the prediction function h as h(x) = τ(θ)(x). We only require τ to be continuous and differentiable almost everywhere such that back-propagation can be applied [Bolte et al., 2021]. Notably, unlike AID, we do not need τ to be twice differentiable, as functional implicit differentiation computes the Hessian w.r.t. the output of τ, not w.r.t. its parameters θ. For flexibility, we can consider a different parameterized model ν : Ξ H for approximating the adjoint function, defined over parameters Ξ Rpadj, constraining the adjoint similarly to τ. In practice, we often use the same parameterization, typically a neural network, for both the inner-level and the adjoint models.

Regularization. With empirical objectives and parametric search spaces, we can directly optimize parameters of both the inner-level model τ and the adjoint model ν. However, to address finite sample issues, regularization may be introduced to these empirical objectives for better generalization. The method allows flexibility in regularization choice, accommodating functions θ 7 Rin(θ) and ξ 7 Radj(ξ), such as ridge penalty or other commonly used regularization techniques

Optimization. The function Inner Opt (defined in Algorithm 2) optimizes inner model parameters for a given ω, initialization θ0, and data Din, using M gradient updates. It returns optimized parameters θM and the corresponding inner model ˆhω = τ(θM), approximating the inner-level solution. Similarly, Adjoint Opt (defined in Algorithm 3) optimizes adjoint model parameters with K gradient updates, producing the approximate adjoint function ˆaω = ν(ξK). Other optimization procedures may also be used, especially when closed-form solutions are available, as exploited in some experiments in Section 4. Operations requiring differentiation can be implemented using standard optimization procedures with automatic differentiation packages like Py Torch [Paszke et al., 2019] or Jax [Bradbury et al., 2018].

Algorithm 2 Inner Opt(ω, θ0, Din)

for m = 0, . . . , M 1 do

Sample batch Bin from Din gin θ[ˆLin (ω, τ(θm), Bin)+Rin(θm)] θm+1 Update θm using gin end for Return τ(θM), θM

Algorithm 3 Adjoint Opt(ω, ξ0, ˆhω, D)

for k = 0, . . . , K 1 do

Sample batch B from D

gadj ξ[ˆLadj(ω, ν(ξt), ˆhω, B)+Radj(ξk)] ξk+1 Update ξk using gadj end for Return ν(ξK), ξK

3.3 Total gradient estimation

We exploit Proposition 2.3 to derive Algorithm 4, which allows us to approximate the total gradient F(ω) after computing the approximate solutions ˆhω and ˆaω. There, we decompose the gradient into two terms: g Exp, an empirical approximation of gω in Equation (8) representing the explicit dependence of Lout on the outer variable ω, and g Imp, an approximation to the implicit gradient term Bωa ω in Equation (8). Both terms are obtained by replacing the expectations by empirical averages batches Bin and Bout, and using the approximations ˆhω and ˆaω instead of the exact solutions.

Algorithm 4 Total Grad(ω, ˆhω, ˆaω, B)

g Exp ω ˆLout(ω, ˆhω, Bout) g Imp 1 |Bin| P

(x,y) Bin ω,vℓin(ω, ˆhω(x), x, y) ˆaω(x) Return g Exp + g Imp

3.4 Convergence Guarantees

Convergence of Algorithm 1 to stationary points of F depends on approximation errors, which result from sub-optimal inner and adjoint solutions, as shown by the convergence result below. Theorem 3.1. Assume that F is L-smooth and admits a finite lower bound F . Use an update rule ωn+1 = ωn ηgout with step size 0 < η 1 4Lω in Algorithm 1. Under Assumption (a) on

sub-optimality of ˆhω and ˆaω, Assumptions (b) to (h) on the smoothness of ℓin and ℓout, all stated in Appendix E.1, and the assumptions in Proposition 2.3, the iterates {ωn}n 0 of Algorithm 1 satisfy:

min 0 n N 1 E h F(ωn) 2i 4 (F(ω0) F )

ηN + 2ηLσ2 eff + (c1ϵin + c2ϵadj),

where c1, c2, σ2 eff are positive constants, and ϵin, ϵadj are sub-optimality errors that result from the inner and adjoint optimization procedures.

Theorem 3.1 is proven in Appendix E and relies on the general convergence result in Demidovich et al. [2024, Theorem 3] for stochastic biased gradient methods. The key idea is to control both bias and variance of the gradient estimator in terms of generalization errors ϵin and ϵadj when approximating the inner and adjoint solutions. These generalization errors can, in turn, be made smaller in the case of over-parameterized networks, by increasing network capacity, number of steps and sample size [Allen-Zhu et al., 2019, Du et al., 2019, Zou et al., 2020].

4 Applications

We consider two applications of the functional bilevel optimization problem: Two-stage least squares regression (2SLS) and Model-based reinforcement learning. To illustrate its effectiveness we compare it with other bilevel optimization approaches like AID or ITD, as well as state-of-the-art methods for each application. We provide a versatile implementation of Func ID (https://github.com/inria-thoth/func BO) in Py Torch [Paszke et al., 2019], compatible with standard optimizers (e.g., Adam [Kingma and Ba, 2015]), and supports common regularization techniques. For the reinforcement learning application, we extend an existing JAX [Bradbury et al., 2018] implementation of model-based RL from Nikishin et al. [2022] to apply Func ID. To ensure fairness, experiments are conducted with comparable computational budgets for hyperparameter tuning using the MLXP experiment management tool [Arbel and Zouaoui, 2024]. Additionally, we maintain consistency by employing identical neural network architectures across all methods and repeating experiments multiple times with different random seeds.

4.1 Two-stage least squares regression (2SLS)

Two-stage least squares regression (2SLS) is commonly used in causal representation learning, including instrumental regression or proxy causal learning [Stock and Trebbi, 2003]. Recent studies have applied bilevel optimization approaches to address 2SLS, yielding promising results [Xu et al.,

Figure 2: Performance metrics for Instrumental Variable (IV) regression. All results are averaged over 20 runs with 5000 training samples and 588 test samples. (Left) box plot of the test loss, with the dashed black line indicating the mean test error. (Middle) outer loss vs training iterations, (Right) inner loss vs training iterations. The bold lines in the middle and right plots indicate the mean loss, the shaded area corresponds to standard deviation.

2021b,a, Hu et al., 2023]. We particularly focus on 2SLS for Instrumental Variable (IV) regression, a widely-used statistical framework for mitigating endogeneity in econometrics [Blundell et al., 2007, 2012], medical economics [Cawley and Meyerhoefer, 2012], sociology [Bollen, 2012], and more recently, for handling confounders in off-line reinforcement learning [Fu et al., 2022].

Problem formulation. In an IV problem, the objective is to model fω : t 7 o that approximates the structural function fstruct using independent samples (o, t, x) from a data distribution P, where x is an instrumental variable. The structural function fstruct delineates the true effect of a treatment t on an outcome o. A significant challenge in IV is the presence of an unobserved confounder ϵ, which influences both t and o additively, rendering standard regression ineffective for recovering fω. However, if the instrumental variable x solely impacts the outcome o through the treatment t and is independent from the confounder ϵ, it can be employed to elucidate the direct relationship between the treatment t and the outcome o using the 2SLS framework, under mild assumptions on the confounder [Singh et al., 2019]. This adaptation replaces the regression problem with a variant that averages the effect of the treatment t conditionally on x

min ω ΩEP h o EP [fω(t)|x] 2i . (10)

Directly estimating the conditional expectation EP [fω(t)|x] is hard in general. Instead, it is easier to express it, equivalently, as the solution of another regression problem predicting fω(t) from x:

h ω := EP [fω(t)|x] = arg min h H EP h fω(t) h(x) 2i . (11)

Both equations result in the bilevel formulation in Equation (6) with y = (t, o), Q = P and the point-wise losses ℓin and ℓout given by ℓin(ω, v, x, y) = ℓin(ω, v, x, (t, o)) = fω(t) v 2

and ℓout(ω, v, x, y) = ℓout(ω, v, x, (t, o)) = o v 2. It is, therefore, possible to directly apply Algorithm 1 to learn fω as we illustrate below.

Experimental setup. We study the IV problem using the dsprites dataset [Matthey et al., 2017], comprising synthetic images representing single objects generated from five latent parameters: shape, scale, rotation, and pos X, pos Y positions on image coordinates. Here, the treatment variable t is the images, the hidden confounder ϵ is the pos Y coordinate, and the other four latent variables form the instrumental variable x. The outcome o is an unknown structural function fstruct of t, contaminated by confounder ϵ as detailed in Appendix G.1. We follow the setup of the Deep Feature Instrumental Variable Regression (DFIV) dsprites experiment by Xu et al. [2021a, Section 4.2], which achieves state-of-the-art performance. In this setup, neural networks serve as the prediction function and structural model, optimized to solve the bilevel problem in Equations (10) and (11). We explore two versions of our method: Func ID, which optimizes all adjoint network parameters, and Func ID linear, which learns only the last layer in closed-form while inheriting hidden layer parameters from the inner prediction function. We compare our method with DFIV, AID, ITD, and Penalty-based methods: gradient penalty (GD penal.) and value function penalty (Val penal.) [Shen and Chen, 2023], using identical network architectures and computational budgets for hyperparameter selection. Full details on network architectures, hyperparameters, and training settings are provided in Appendix G.2.

Results. Figure 2 compares structural models learned by different methods using 5K training samples (refer to Figure 6 in Appendix G.3 for 10K sample results). The left subplot illustrates out-of-sample mean squared error of learned structural models compared to ground truth outcomes (uncontaminated by noise ϵ), while the middle and right subplots show the evolution of outer and inner objectives over iterations. For the 5K dataset, Func ID outperforms DFIV (p-value=0.003, one-sided paired t-test), while showing comparable performance on the 10K dataset (Figure 6). AID and ITD perform notably worse, indicating their parametric approach fails to fully leverage the functional structure. Func ID outperforms the gradient penalty method and performs either better or comparably to the value function penalty method, though the latter shows higher variance with some particularly bad outliers. While all methods achieve similar outer losses, this criterion alone is only reliable as an indicator of convergence when evaluated near the exact inner-level solution corresponding to the lowest inner-loss values. Interestingly, Func ID obtains the lowest inner-loss values, suggesting its outer-loss is a more reliable indicator of convergence.

4.2 Model-based reinforcement learning

Model-based reinforcement learning (RL) naturally yields bilevel optimization formulations, since several components of an RL agent need to be learned using different objectives. Recently, Nikishin et al. [2022] showed that casting model-based RL as a bilevel problem can result in better tolerance to model-misspecification. Our experiments show that the functional bilevel framework yields improved results even when the model is well-specified, suggesting a broader use of the bilevel formulation.

Problem formulation. In model-based RL, the Markov Decision Process (MDP) is approximated by a probabilistic model qω with parameters ω that can predict the next state sω(x) and reward rω(x), given a pair x := (s, a) where s is the current environment state and a is the action of an agent. A second model h can be used to approximate the action-value function h(x) that computes the expected cumulative reward given the current state-action pair. Traditionally, the action-value function is learned using the current MDP model, while the latter is learned independently from the action-value function using Maximum Likelihood Estimation (MLE) [Sutton, 1991].

In the bilevel formulation of model-based RL by Nikishin et al. [2022], the inner-level problem involves learning the optimal action-value function h ω with the current MDP model qω and minimizing the Bellman error. The inner-level objective can be expressed as an expectation of a point-wise loss f with samples (x, r , s ) P, derived from the agent-environment interaction:

h ω = arg min h H EP [f(h(x), rω(x), sω(x))] . (12)

Here, the future state and reward (r , s ) are replaced by the MDP model predictions rω(x) and sω(x). In practice, samples from P are obtained using a replay buffer. The buffer accumulates data over several episodes of interactions with the environment, and can therefore be considered independent of the agent s policy. The point-wise loss function f represents the error between the action-value function prediction and the expected cumulative reward given the current state-action pair:

f(v, r , s ) := 1

v r γ log P

a e h(s ,a ) 2 ,

with h a lagged version of h (exponentially averaged network) and γ a discount factor. The MDP model is learned implicitly using the optimal function h ω, by minimizing the Bellman error w.r.t. ω:

min ω ΩEP [f(h ω(x), r , s )] . (13)

Equations (12) and (13) define a bilevel problem as in Equation (6), where Q = P, y = (r , s ), and the point-wise losses ℓin and ℓout are given by: ℓin (ω, v, x, y) = f (v, rω(x), sω(x)) and ℓout (ω, v, x, y) = f (v, r , s ). Therefore, we can directly apply Algorithm 1 to learn both the MDP model qω and the optimal action-value function h ω.

Experimental setup. We apply Func ID to the Cart Pole control problem, a classic benchmark in reinforcement learning [Brockman et al., 2016, Nagendra et al., 2017]. The goal is to balance a pole attached to a cart by moving the cart horizontally. Following Nikishin et al. [2022], we use a modelbased approach and consider two scenarios: one with a well-specified network accurately representing the MDP, and another with a misspecified model having fewer hidden layer units, limiting its capacity.

Figure 3: Average reward on an evaluation environment vs. training iterations on the Cart Pole task. (Left) Well-specified model with 32 hidden units. (Right) Misspecified model with 3 hidden units. Both plots show mean reward over 10 runs where the shaded region is the 95% confidence interval.

Using the bilevel formulation in Equations (12) and (13), we compare Func ID with the Optimal Model Design (OMD) algorithm [Nikishin et al., 2022], a variant of AID. Additionally, we compare against a standard single-level model-based RL formulation using Maximum Likelihood Estimation (MLE) [Sutton, 1991]. For the adjoint function in Func ID, we derive a simple closed-form expression based on the structure of the adjoint objective (see Appendix H.1). We follow the experimental setup of Nikishin et al. [2022], providing full details and hyperparameters in Appendix H.2.

Results. Figure 3 illustrates the training reward evolution for Func ID, OMD, and MLE in both well-specified and misspecified scenarios. Func ID consistently performs well across settings. In the well-specified case, where OMD achieves a reward of 4, Func ID reaches the maximum reward of 5, matching MLE (left Figure 3). In the misspecified scenario, Func ID performs comparably to OMD and significantly outperforms MLE (right Figure 3). Moreover, Func ID tends to converge faster than MLE (see Figure 7 in Appendix H.3) and yields consistently better prediction error than OMD (see Figure 8 in Appendix H.3). These findings align with Nikishin et al. [2022], suggesting that MLE may prioritize minimizing prediction errors, potentially leading to overfitting irrelevant features. In contrast, OMD and Func ID focus on maximizing expected returns, especially in the presence of model misspecification. Our results highlight the effectiveness of (FBO) even in well-specified settings, suggesting, for future work, further investigations for more general RL tasks.

5 Discussion and concluding remarks

This paper introduced a functional paradigm for bilevel optimization in machine learning, shifting focus from parameter space to function space. The proposed approach specifically addresses the ambiguity challenge arising from using deep networks in bilevel optimization. The paper establishes the validity of the functional framework by developing a theory of functional implicit differentiation, proving convergence for the proposed Func ID method, and numerically comparing it with other bilevel optimization methods.

The theoretical foundations of our work rely on several key assumptions worth examining. While our convergence guarantees assume both inner and adjoint optimization problems are solved to some optimality, this assumption is supported by recent results on global convergence in over-parameterized networks [Allen-Zhu et al., 2019, Liu et al., 2022a]. However, quantifying these optimality errors more precisely and understanding their relationship to optimization procedures remains an open challenge. Additionally, like other bilevel methods, our approach requires careful hyperparameter selection, which can impact practical implementation.

Several promising directions emerge for future research. While we focus on L2 spaces, exploring alternative function spaces (such as Reproducing Kernel Hilbert Spaces or Sobolev spaces) could reveal additional advantages for specific applications. Furthermore, extending our framework to non-smooth objectives or constrained optimization problems, potentially building on existing work in non-smooth implicit differentiation [Bolte et al., 2022], would broaden its applicability.

Acknowledgments

This work was supported by the ERC grant number 101087696 (APHELAIA project) and by ANR 3IA MIAI@Grenoble Alpes (ANR-19-P3IA-0003) and the ANR project BONSAI (grant ANR-23CE23-0012-01). We thank Edouard Pauwels and Samuel Vaiter for their insightful discussions.

P. Ablin, G. Peyré, and T. Moreau. Super-efficiency of automatic differentiation for functions defined as a minimum. International Conference on Machine Learning (ICML), 2020.

K. Ahuja, K. Shanmugam, K. Varshney, and A. Dhurandhar. Invariant risk minimization games. International Conference on Machine Learning (ICML), 2020.

Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning (ICML), 2019.

B. Amos et al. Tutorial on amortized optimization. Foundations and Trends in Machine Learning, 16(5):592 732, 2023.

M. Arbel and J. Mairal. Amortized implicit differentiation for stochastic bilevel optimization. International Conference on Learning Representations (ICLR), 2022a.

M. Arbel and J. Mairal. Non-convex bilevel games with critical point selection maps. Advances in Neural Information Processing Systems (Neur IPS), 2022b.

M. Arbel and A. Zouaoui. Mlxp: A framework for conducting replicable machine learning experiments in python. ar Xiv preprint ar Xiv:2402.13831, 2024. URL https://github.com/ inria-thoth/mlxp.

M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz. Invariant risk minimization. ar Xiv preprint 1907.02893, 2019.

J. Bae and R. B. Grosse. Delta-stn: Efficient bilevel optimization for neural networks using structured response jacobians. Advances in Neural Information Processing Systems (Neur IPS), 2020.

D. Bansal, R. T. Chen, M. Mukadam, and B. Amos. Taskmet: Task-driven metric learning for model learning. Advances in Neural Information Processing Systems (Neur IPS), 2023.

H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer Publishing Company, Incorporated, 1st edition, 2011. ISBN 1441994661.

A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic differentiation in machine learning: A survey. Journal of Machine Learning Research (JMLR), 18(153):1 43, 2017.

Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157 166, 1994.

K. P. Bennett, J. Hu, X. Ji, G. Kunapuli, and J.-S. Pang. Model selection via bilevel optimization. IEEE International Joint Conference on Neural Network Proceedings, 2006.

L. Bertinetto, J. F. Henriques, P. H. Torr, and A. Vedaldi. Meta-learning with differentiable closedform solvers. International Conference on Learning Representations (ICLR), 2019.

M. Bi nkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying mmd gans. ar Xiv preprint ar Xiv:1801.01401, 2018.

M. Blondel, Q. Berthet, M. Cuturi, R. Frostig, S. Hoyer, F. Llinares-López, F. Pedregosa, and J.-P. Vert. Efficient and modular implicit differentiation. Advances in Neural Information Processing Systems (Neur IPS), 2022.

R. Blundell, X. Chen, and D. Kristensen. Semi-nonparametric iv estimation of shape-invariant engel curves. Econometrica, 75(6):1613 1669, 2007.

R. Blundell, J. L. Horowitz, and M. Parey. Measuring the price responsiveness of gasoline demand: Economic shape restrictions and nonparametric demand estimation. Quantitative Economics, 3(1): 29 51, 2012.

K. A. Bollen. Instrumental variables in sociology and the social sciences. Annual Review of Sociology, 38:37 72, 2012.

J. Bolte, T. Le, E. Pauwels, and T. Silveti-Falls. Nonsmooth implicit differentiation for machinelearning and optimization. Advances in Neural Information Processing Systems (Neur IPS), 2021. J. Bolte, E. Pauwels, and S. Vaiter. Automatic differentiation of nonsmooth iterative algorithms. Advances in Neural Information Processing Systems (Neur IPS), 2022. J. Bolte, E. Pauwels, and S. Vaiter. One-step differentiation of iterative algorithms. Advances in Neural Information Processing Systems (Neur IPS), 2024. J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Vander Plas, S. Wanderman-Milne, and Q. Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. A. Brock, T. Lim, J. Ritchie, and N. Weston. SMASH: One-shot model architecture search through hypernetworks. International Conference on Learning Representations (ICLR), 2018.

G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. ar Xiv preprint 1606.01540, 2016.

J. Cawley and C. Meyerhoefer. The medical care costs of obesity: an instrumental variables approach. Journal of Health Economics, 31(1):219 230, 2012.

R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. Advances in Neural Information Processing Systems (Neur IPS), 2018. L. Debnath and P. Mikusinski. Introduction to Hilbert spaces with applications. Academic press, 2005. Y. Demidovich, G. Malinovsky, I. Sokolov, and P. Richtárik. A guide through the zoo of biased sgd. Advances in Neural Information Processing Systems (Neur IPS), 2024.

S. Dempe, J. Dutta, and B. Mordukhovich. New necessary optimality conditions in optimistic bilevel programming. Optimization, 56(5-6):577 604, 2007.

J. Domke. Generic methods for optimization-based modeling. International Conference on Artificial Intelligence and Statistics (AISTATS), 2012. S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning (ICML), 2019.

Z. Fang and A. Santos. Inference on Directionally Differentiable Functions. The Review of Economic Studies, 86(1):377 412, 2018. M. Feurer and F. Hutter. Hyperparameter optimization. Springer International Publishing, 2019. L. Franceschi, M. Donini, P. Frasconi, and M. Pontil. Forward and reverse gradient-based hyperparameter optimization. International Conference on Machine Learning (ICML), 2017. Z. Fu, Z. Qi, Z. Wang, Z. Yang, Y. Xu, and M. R. Kosorok. Offline reinforcement learning with instrumental variables in confounded markov decision processes. ar Xiv preprint 2209.08666, 2022. S. Ghadimi and M. Wang. Approximation methods for bilevel programming. Optimization and Control, 2018.

V. Goodman. Quasi-differentiable functions of banach spaces. Proceedings of the American Mathematical Society, 30(2):367 370, 1971. S. Gould, B. Fernando, A. Cherian, P. Anderson, R. S. Cruz, and E. Guo. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. ar Xiv preprint 1607.05447, 2016. R. Grazzi, L. Franceschi, M. Pontil, and S. Salzo. On the iteration complexity of hypergradient computation. International Conference on Machine Learning (ICML), 2020.

D. Ha, A. M. Dai, and Q. V. Le. Hypernetworks. International Conference on Learning Representations (ICLR), 2017. G. Holler, K. Kunisch, and R. C. Barnard. A bilevel approach for parameter learning in inverse problems. Inverse Problems, 34(11):115012, 2018.

M. Hong, H. Wai, Z. Wang, and Z. Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic. SIAM Journal on Optimization, 33(1):147 180, 2023.

Y. Hu, J. Wang, Y. Xie, A. Krause, and D. Kuhn. Contextual stochastic bilevel optimization. Advances in Neural Information Processing Systems (Neur IPS), 2023.

A. D. Ioffe and V. M. Tihomirov. Theory of Extremal Problems. Series: Studies in Mathematics and its Applications 6. Elsevier, 1979.

K. Ji, J. Yang, and Y. Liang. Bilevel optimization: Convergence analysis and enhanced design. International Conference on Machine Learning (ICML), 2021.

J. Jia and A. R. Benson. Neural jump stochastic differential equations. Advances in Neural Information Processing Systems (Neur IPS), 2019.

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), 2015.

D. P. Kingma and M. Welling. Auto-encoding variational bayes. International Conference on Learning Representations (ICLR), 2014.

G. Kunapuli, K. Bennett, J. Hu, and J.-S. Pang. Bilevel model selection for support vector machines. CRM Proceedings and Lecture Notes, 45:129 158, 2008.

J. Kwon, D. Kwon, S. Wright, and R. D. Nowak. On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation. In International Conference on Learning Representations (ICLR), 2024.

S. Li, Z. Wang, A. Narayan, R. Kirby, and S. Zhe. Meta-learning with adjoint methods. International Conference on Artificial Intelligence and Statistics (AISTATS), 2023.

X. Li, T.-K. L. Wong, R. T. Chen, and D. Duvenaud. Scalable gradients for stochastic differential equations. International Conference on Artificial Intelligence and Statistics (AISTATS), 2020.

R. Liao, Y. Xiong, E. Fetaya, L. Zhang, K. Yoon, X. Pitkow, R. Urtasun, and R. Zemel. Reviving and improving recurrent back-propagation. International Conference on Machine Learning (ICML), 2018.

C. Liu, L. Zhu, and M. Belkin. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 59:85 116, 2022a.

R. Liu, X. Liu, S. Zeng, J. Zhang, and Y. Zhang. Value-function-based sequential minimization for bi-level optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 45:15930 15948, 2021a.

R. Liu, Y. Liu, S. Zeng, and J. Zhang. Towards gradient-based bilevel optimization with non-convex followers and beyond. Advances in Neural Information Processing Systems (Neur IPS), 2021b.

R. Liu, J. Gao, J. Zhang, D. Meng, and Z. Lin. Investigating bi-level optimization for learning and vision from a unified perspective: a survey and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 44:10045 10067, 2022b.

R. Liu, Y. Liu, W. Yao, S. Zeng, and J. Zhang. Averaged method of multipliers for bi-level optimization without lower-level strong convexity. In International Conference on Machine Learning (ICML), pages 21839 21866. PMLR, 2023.

J. Lorraine, P. Vicol, and D. K. Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.

M. Mac Kay, P. Vicol, J. Lorraine, D. Duvenaud, and R. B. Grosse. Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions. International Conference on Learning Representations (ICLR), 2019.

J. Mairal, F. Bach, and J. Ponce. Task-driven dictionary learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 34(4):791 804, 2012.

C. C. Margossian and M. Betancourt. Efficient automatic differentiation of implicit functions. ar Xiv preprint 2112.14217, 2021.

P. Marion, A. Korba, P. Bartlett, M. Blondel, V. De Bortoli, A. Doucet, F. Llinares-López, C. Paquette, and Q. Berthet. Implicit diffusion: Efficient optimization through stochastic sampling. ar Xiv preprint ar Xiv:2402.05468, 2024.

L. Matthey, I. Higgins, D. Hassabis, and A. Lerchner. dsprites: Disentanglement testing sprites dataset, 2017.

S. Nagendra, N. Podila, R. Ugarakhod, and K. George. Comparison of reinforcement learning algorithms applied to the cart-pole problem. International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2017. A. Navon, I. Achituve, H. Maron, G. Chechik, and E. Fetaya. Auxiliary learning by implicit differentiation. International Conference on Learning Representations (ICLR), 2021. A. Nemirovski and S. Semenov. On polynomial approximation of functions on hilbert space. Mathematics of the USSR-Sbornik, 21(2):255, 1973. E. Nikishin, R. Abachi, R. Agarwal, and P.-L. Bacon. Control-oriented model-based reinforcement learning with implicit differentiation. AAAI Conference on Artificial Intelligence, 2022. D. Noll. Second order differentiability of integral functionals on Sobolev spaces and L2-spaces. Walter de Gruyter, Berlin/New York Berlin, New York, 1993. R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. International Conference on Machine Learning (ICML), 2013. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. De Vito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems (Neur IPS), 2019.

F. Pedregosa. Hyperparameter optimization with approximate gradient. International Conference on Machine Learning (ICML), 2016. L. S. Pontryagin. Mathematical Theory of Optimal Processes. Routledge, 2018.

D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. International Conference on Machine Learning (ICML), 2014.

H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400 407, 1951.

S. Rosset. Bi-level path following for cross validated solution of kernel quantile regression. International Conference on Machine Learning (ICML), 2008.

A. Shaban, C.-A. Cheng, N. Hatch, and B. Boots. Truncated back-propagation for bilevel optimization. International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.

H. Shen and T. Chen. On penalty-based bilevel gradient descent method. In International Conference on Machine Learning (ICML), 2023. R. Singh, M. Sahani, and A. Gretton. Kernel instrumental variable regression. Advances in Neural Information Processing Systems (NIPS), 2019.

J. H. Stock and F. Trebbi. Retrospectives: Who invented instrumental variable regression? Journal of Economic Perspectives, 17(3):177 194, 2003.

E. Suonperä and T. Valkonen. Linearly convergent bilevel optimization with single-step inner methods. Computational Optimization and Applications, 87(2):571 610, 2024. R. S. Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160 163, 1991.

A. van der Vaart. Efficiency and hadamard differentiability. Scandinavian Journal of Statistics, 18(1): 63 75, 1991.

A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes with Applications to Statistics, pages 16 28. Springer New York, 1996. P. Vicol, J. Lorraine, D. Duvenaud, and R. Grosse. Implicit regularization in overparameterized bilevel optimization. International Conference on Machine Learning (ICML), 2022. C. Williams and M. Seeger. Using the nyström method to speed up kernel machines. Advances in Neural Information Processing Systems (NIPS), 2000. L. Xu, Y. Chen, S. Srinivasan, N. de Freitas, A. Doucet, and A. Gretton. Learning deep features in instrumental variable regression. International Conference on Learning Representations (ICLR), 2021a.

L. Xu, H. Kanagawa, and A. Gretton. Deep proxy causal learning and its application to confounded bandit policy evaluation. Advances in Neural Information Processing Systems (Neur IPS), 2021b.

J. Ye and X. Ye. Necessary optimality conditions for optimization problems with variational inequality constraints. Mathematics of Operations Research, 22(4):977 997, 1997. J. Ye and D. Zhu. Optimality conditions for bilevel programming problems. Optimization, 33(1): 9 27, 1995.

J. Ye, D. Zhu, and Q. J. Zhu. Exact penalization and necessary optimality conditions for generalized bilevel programming problems. SIAM Journal on Optimization, 7(2):481 507, 1997.

C. Zhang, M. Ren, and R. Urtasun. Graph hypernetworks for neural architecture search. International Conference on Learning Representations (ICLR), 2019.

Y. D. Zhong, B. Dey, and A. Chakraborty. Symplectic ode-net: Learning hamiltonian dynamics with control. ar Xiv preprint 1909.12077, 2019. D. Zou, Y. Cao, D. Zhou, and Q. Gu. Gradient descent optimizes over-parameterized deep relu networks. Machine learning, 109:467 492, 2020.

A Examples of FBO formulations

The functional bilevel setting applies to various practical bilevel problems where objectives depend solely on model predictions, not their parameterization. Below, we discuss a few examples.

Auxiliary task learning. As in Equation (1), consider a main prediction task with features x and labels y, equipped with a loss function f(y, h(x)). The goal of auxiliary task learning is to learn how a set of auxiliary tasks represented by a vector faux(y, h(x)) could help solve the main task. This problem is formulated by Navon et al. [2021] as a bilevel problem, which can be written as (FBO) with Lout(ω, h) = E(y,x) Dval [f(y, h(x))] , where the loss is evaluated over a validation dataset Dval, and

Lin(ω, h) = E(y,x) Dtrain [f(y, h(x)) + gω(faux(y, h(x)))] ,

where an independent training dataset Dtrain is used, and gω is a function that combines the auxiliary losses into a scalar value.

Task-driven metric learning. Considering now a regression problem with features x and labels y, the goal of task-driven metric learning formulated by Bansal et al. [2023] is to learn a metric parameterized by ω for the regression task such that the corresponding predictor h ω performs well on a downstream task Ltask. This can be formulated as (FBO) with Lout(ω, h) = Ltask(h) and

Lin(ω, h) = E(y,x) h y h(x) 2 Aω(x) i ,

where 2 ω is the squared Mahalanobis norm with parameters ω and Aω(x) is a data-dependent metric that allows emphasizing features that are more important for the downstream task.

B Additional Related Work

Bilevel optimization in machine learning. Two families of bilevel methods are prevalent in machine learning due to their scalability: iterative (or unrolled ) differentiation (ITD, Baydin et al., 2017) and Approximate Implicit Differentiation (AID, Ghadimi and Wang, 2018). ITD approximates the optimal inner-level solution using an unrolled function from a sequence of differentiable optimization steps, optimizing the outer variable via back-propagation [Shaban et al., 2019, Bolte et al., 2024]. The gradient approximation error decreases linearly with the number of steps when the inner-level is strongly convex, though at increased computational and memory costs [Grazzi et al., 2020, Theorem 2.1]. ITD is popular for its simplicity and availability in major deep learning libraries [Bradbury et al., 2018], but can be unstable, especially with non-convex inner objectives [Pascanu et al., 2013, Bengio et al., 1994, Arbel and Mairal, 2022b]. AID uses the Implicit Function Theorem (IFT) to derive the Jacobian of the inner-level solution with respect to the outer variable [Lorraine et al., 2019, Pedregosa, 2016], solving a finite-dimensional linear system for an adjoint vector representing optimality constraints. AID offers strong convergence guarantees for smooth, strongly convex inner objectives [Ji et al., 2021, Arbel and Mairal, 2022a]. However, without strong convexity, the linear system can become ill-posed due to a degenerate Hessian, leading to instabilities, especially with overparameterized deep neural networks. Our proposed approach avoids this issue, even when using deep networks for function approximation.

Adjoint sensitivity method. The adjoint sensitivity method [Pontryagin, 2018] efficiently differentiates a controlled variable with respect to a control parameter. In bilevel optimization, AID applies a finite-dimensional version of this method [Margossian and Betancourt, 2021, Section 2]. Infinite-dimensional versions have differentiated solutions of ordinary differential equations (ODEs) with respect to defining parameters [Margossian and Betancourt, 2021, Section 3], and have been used in machine learning to optimize parameters of a vector field describing an ODE [Chen et al., 2018]. Here, the ODE s vector field, parameterized by a neural network, is optimized to match observations. The adjoint sensitivity method offers an efficient alternative to the unstable process of back-propagation through ODE solvers, requiring only the solution of an adjoint ODE to compute gradient updates, improving performance [Jia and Benson, 2019, Zhong et al., 2019, Li et al., 2020]. This method has been adapted to meta-learning [Li et al., 2023], viewing the inner optimization as an

ODE evolution with gradients obtained via the adjoint ODE. Recently, Marion et al. [2024] employ the adjoint sensitivity method for optimizing diffusion models, where an adjoint SDE is solved to compute the total gradient. Unlike these works, which use the adjoint method for ODE/SDE solutions as functions of time, our work applies an infinite-dimensional version of the adjoint sensitivity method to general learning problems, where solutions are functions of input data.

Amortization. Recently, several methods have used amortization to approximately solve bilevel problems [Mac Kay et al., 2019, Bae and Grosse, 2020]. These methods employ a parametric model called a hypernetwork [Ha et al., 2017, Brock et al., 2018, Zhang et al., 2019], optimized to directly predict the inner-level solution given the outer-level parameter ω. Amortized methods treat the two levels as independent optimization problems: (1) learning the hypernetwork for a range of ω, and (2) performing first-order descent on ω using the hypernetwork as a proxy for the inner-level solution. Unlike ITD, AID, or our functional implicit differentiation method, amortized methods do not fully exploit the implicit dependence between the two levels. They are similar to amortized variational inference [Kingma and Welling, 2014, Rezende et al., 2014], where a parametric model produces approximate samples from a posterior distribution. Amortization methods perform well when the inner solution s dependence on ω is simple but may fail otherwise [Amos et al., 2023, pages 71-72]. In contrast, our functional implicit differentiation framework adapts to complex implicit dependencies between the inner solution and ω.

C Theory for Functional Implicit Differentiation

C.1 Preliminary results

We recall the definition of Hadamard differentiability and provide in Proposition C.2 a general property for Hadamard differentiable maps that we will exploit to prove Theorem 2.1 in Appendix C.2. Definition C.1. Hadamard differentiability. Let A and B be two separable Banach spaces. A function L : A B is said to be Hadamard differentiable [van der Vaart, 1991, Fang and Santos, 2018] if for any a A, there exist a continuous linear map da L(a) : A B so that for any sequence (un)n 1 in A converging to an element u A, and any real valued and non-vanishing sequence (tn)n 1 converging to 0, it holds that: 1 tn (L(a + tnun) L(a)) da L(a)u n + 0. (14)

Proposition C.2. Let A and B be two separable Banach spaces. Let L : A B be a Hadamard differentiable map with differential da L at point a. Consider a bounded linear map defined over a euclidean space Rn of finite dimension n and taking values in A, i.e J : Rn A. Then, the following holds: L(a + Ju) = L(a) + da L(a)Ju + o( u ).

Proof. Consider a sequence (uk)k 1 in Rn so that uk converges to 0 with uk > 0 for all k 1 and define the first order error Ek as follows:

Ek = 1 uk L(a + Juk) L(a) da L(a)uk .

The goal is to show that Ek converges to 0. We can write uk as uk = tk uk with tk = uk and uk = 1, so that:

Ek = 1 tk (L(a + tk J uk) L(a)) da L(a) uk

If Ek were unbounded, then, by contradiction, there must exist a subsequence (Eϕ(k))k 1 converging to + , with ϕ(k) increasing and ϕ(k) + . Moreover, since uk is bounded, one can further choose the subsequence Eϕ(k) so that uϕ(k) converges to some element u. We can use the following upper-bound:

Ek 1 tk (L(a + tk J uk) L(a)) da L(a) u | {z } Ek

+ da L(a) uk u , (15)

where we used that da L(a) is bounded. Since L is Hadamard differentiable, Eϕ(k) converges to 0. Moreover, uϕ(k) u also converges to 0. Hence, Eϕ(k) converges to 0 which contradicts Eϕ(k) + . Therefore, Ek is bounded.

Consider now any convergent subsequence of (Ek)k 1. Then, it can be written as (Eϕ(k))k 1 with ϕ(k) increasing and ϕ(k) + . We then have Eϕ(k) e < + by construction. Since uk is bounded, one can further choose the subsequence Eϕ(k) so that uϕ(k) converges to some element u. Using again Equation (15) and the fact that L is Hadamard differentiable, we deduce that Eϕ(k) must converge to 0, and by definition of uϕ(k), that uϕ(k) u converges to 0. Therefore, it follows that Eϕ(k) 0, so that e = 0. We then have shown that (Ek)k 1 is a bounded sequence and every subsequence of it converges to 0. Therefore, Ek must converge to 0, which concludes the proof.

C.2 Proof of the Functional implicit differentiation theorem

Proof of Theorem 2.1. The proof strategy consists in establishing the existence and uniqueness of the solution map ω 7 h ω, deriving a candidate Jacobian for it, then proving that ω 7 h ω is differentiable.

Existence and uniqueness of a solution map ω 7 h ω. Let ω in Ωbe fixed. The map h 7 Lin(ω, h) is lower semi-continuous since it is Fréchet differentiable by assumption. It is also strongly convex. Therefore, it admits a unique minimier h ω [Bauschke and Combettes, 2011, Corollary 11.17]. We then conclude that the map ω 7 h ω is well-defined on Ω.

Strong convexity inequalities. We provide two inequalities that will be used for proving differentiability of the map ω 7 h ω. The map h 7 Lin(ω, h) is Fréchet differentiable on H and µ-strongly convex (with µ positive by assumption). Hence, for all h1, h2 in H the following quadratic lower-bound holds: Lin(ω, h2) Lin(ω, h1) + h Lin(ω, h1), (h2 h1) H + µ

2 h2 h1 2 H . (16)

From the inequality above, we can also deduce that h 7 h Lin(ω, h) is a µ-strongly monotone operator:

h Lin(ω, h1) h Lin(ω, h2), h1 h2 H µ h1 h2 2 H . (17) Finally, note that, since h 7 Lin(ω, h) is Fréchet differentiable, its gradient must vanish at the optimum h ω, i.e : h Lin(ω, h ω) = 0. (18)

Candidate Jacobian for ω 7 h ω. Let ω be in Ω. Using Equation (17) with h1 = h + tv and h2 = h for some h, v H, and a non-zeros real number t we get: 1

t h Lin(ω, h + tv) h Lin(ω, h), v H µ v 2 . (19)

By assumption, h 7 h Lin(ω, h) is Hadamard differentiable and, a fortiori, directionally differentiable. Thus, by taking the limit when t 0, it follows that:

2 h Lin(ω, h)v, v H µ v 2 . (20) Hence, 2 h Lin(ω, h) : H H defines a coercive quadratic form. By definition of Hadamard differentiability, it is also bounded. Therefore, it follows from Lax-Milgram s theorem [Debnath and Mikusinski, 2005, Theorem 4.3.16], that 2 h Lin(ω, h) is invertible with a bounded inverse. Moreover, recalling that Bω = ω,h Lin(ω, h ω) is a bounded operator, its adjoint (Bω) is also a bounded operator from Ωto H. Therefore, we can define J = C 1 ω (Bω) which is a bounded linear map from Ωto H and will be our candidate Jacobian.

Differentiability of ω 7 h ω. By the strong convexity assumption (locally in ω), there exists an open ball B centered at the origin 0 that is small enough so that we can ensure the existence of µ > 0 for which h 7 Lin(ω + ϵ, h) is µ-strongly convex for all ϵ B. For a given ϵ B, we use the µ-strong monotonicity of h 7 h Lin(ω + ϵ, h) (17) at points h ω + Jϵ and h ω+ϵ to get:

µ h ω+ϵ h ω Jϵ 2 h Lin ω + ϵ, h ω+ϵ h Lin (ω + ϵ, h ω + Jϵ) , h ω+ϵ h ω Jϵ H = h Lin (ω + ϵ, h ω + Jϵ) , h ω+ϵ h ω Jϵ H h Lin (ω + ϵ, h ω + Jϵ) H h ω+ϵ h ω Jϵ H ,

where the second line follows from optimality of h ω+ϵ (Equation (18)), and the last line uses Cauchy-Schwarz s inequality. The above inequality allows us to deduce that:

h ω+ϵ h ω Jϵ 1

µ h Lin (ω + ϵ, h ω + Jϵ) H . (21)

Moreover, since h Lin is Hadamard differentiable, by Proposition C.2 it follows that:

h Lin (ω + ϵ, h ω + Jϵ) = h Lin (ω, h ω) | {z } =0

+d(ω,h) h Lin(ω, h ω)(ϵ, Jϵ) + o( ϵ ), (22)

where the first term vanishes as a consequence of Equation (18), since h ω is a minimizer of h 7 Lin(ω, h). Additionally, note that the differential d(ω,h) h Lin(ω, h) : Ω H H acts on elements (ϵ, g) Ω H as follows:

d(ω,h) h Lin(ω, h)(ϵ, g) = 2 h Lin(ω, h)g + ( ω,h Lin(ω, h)) ϵ, (23)

where 2 h Lin(ω, h) : H H and ω,h Lin(ω, h) : H Ωare bounded operators and ( ω,h Lin(ω, h)) denotes the adjoint of ω,h Lin(ω, h). By definition of J, and using Equation (23), it follows that:

d(ω,h) h Lin(ω, h)(ϵ, Jϵ) = CωJϵ + Bωϵ = 0.

Therefore, combining Equation (22) with the above equality yields:

h Lin (ω + ϵ, h ω + Jϵ) = o( ϵ ). (24)

Finally, combining Equation (21) with the above equality directly shows that h ω+ϵ h ω Jϵ 1 µo( ϵ ). We have shown that ω 7 h ω is differentiable with a Jacobian map ωh ω given by J = BωC 1 ω .

C.3 Proof of the functional adjoint sensitivity in Proposition 2.2

Proof of Proposition 2.2. We use the assumptions and definitions from Proposition 2.2 and express the gradient F(ω) using the chain rule:

F(ω) = ωLout(ω, h ω) + [ ωh ω] h Lout(ω, h ω).

The Jacobian ωh ω is the solution of a linear system obtained by applying Theorem 2.1 :

ωh ω = BωC 1 ω .

We note gω = ωLout(ω, h ω) and dω = h Lout(ω, h ω). It follows that the gradient F(ω) can be expressed as:

F(ω) = gω + [ ωh ω] dω = gω + Bωa ω a ω := C 1 ω dω.

In other words, the implicit gradient F(ω) can be expressed using the adjoint function a ω, which is an element of H and can be defined as the solution of the following functional regression problem:

a ω = arg min a H Ladj(ω, a) := 1

2 a, Cωa H + a, dω H.

D Functional Adjoint Sensitivity Results in L2 Spaces

In this section we provides full proofs of Proposition 2.3. We start by stating the assumptions needed on the data distributions and the point-wise losses in Appendix D.1, then provide some differentiation results in Appendix D.2 and conclude with the main proofs in Appendix D.3.

D.1 Assumptions

Assumption on P and Q.

(A) P and Q admit finite second moments.

(B) The marginal of X w.r.t. Q admits a Radon-Nikodym derivative r(x) w.r.t. the marginal of X w.r.t. P, i.e. d Q(x, Y) = r(x) d P(x, Y). Additionally, r(x) is upper-bounded by a positive constant M.

Assumptions on ℓin.

(C) For any ω Ω, there exists a positive constant µ and a neighborhood B of ω for which ℓin is µ-strongly convex in its second argument for all (ω , x, y) B X Y.

(D) For any ω Ω, EP h |ℓin(ω, 0, x, y)| + vℓin (ω, 0, x, y) 2i < + .

(E) v 7 ℓin(ω, v, x, y) is continuously differentiable for all (ω, x, y) Ω X Y.

(F) For any fixed ω Ω, there exists a constant L and a neighborhood B of ω s.t. v 7 ℓin(ω , v, x, y) is L-smooth for all ω , x, y B X Y. (G) (ω, v) 7 vℓin(ω, v, x, y) is continuously differentiable on Ω V for all x, y X Y, (H) For any ω Ω, there exists a positive constant C and a neighborhood B of ω s.t. for all (ω , x, y) B X Y:

ω,vℓin (ω , 0, x, y) C (1 + x + y ) . (25)

(I) For any ω Ω, there exists a positive constant C and a neighborhood B of ω s.t. for all (ω , v1, v2, x, y) B V V X Y we have:

ω,vℓin (ω , v1, x, y) ω,vℓin (ω , v2, x, y) C v1 v2 . (26)

Assumptions on ℓout.

(J) For any ω Ω, EQ [|ℓout(ω, 0, x, y)|] < + . (K) (ω, v) 7 ℓout(ω, v, x, y) is jointly continuously differentiable on Ω V for all (x, y) X Y. (L) For any ω Ω, there exits a neighborhood B of ω and a positive constant C s.t. for all (ω , v, v , x, y) B V V X Y we have:

ωℓout (ω , v, x, y) ωℓout (ω , v , x, y) C (1 + v + v + x + y ) v v ,

vℓout (ω , v, x, y) vℓout (ω , v , x, y) C v v ,

vℓout (ω , v, x, y) C (1 + v + x + y ) ,

ωℓout (ω , v, x, y) C 1 + v 2 + x 2 + y 2 .

Example. Here we consider the squared error between two vectors v, z in V. Given a map (ω, x, y) 7 fω(x, y) defined over Ω X Y and taking values in V, we define the following point-wise objective:

ℓ(ω, v, x, y) := 1

2 v z 2 , z = fω(x, y). (27)

We assume that for any ω Ω, there exists a constant C > 0 such that for all ω in a neighborhood of ω and all x, y X Y, the following growth assumption holds:

fω (x, y) + ωfω (x, y) C(1 + x + y ). (28)

This growth assumption is weak in the context of neural networks with smooth activations as discussed by Bi nkowski et al. [2018, Appendix C.4]. Proposition D.1. Assume that the map ω 7 fω(x, y) is continuously differentiable for any x, y X Y, and that Equation (28) holds. Additionally, assume that P and Q admit finite second order moments. Then the point-wise objective ℓin Equation (27) satisfies Assumptions (C) to (L).

Proof. We show that each of the assumptions are satisfied by the classical squared error objective.

Assumption (C): the squared error is 1-strongly convex in v, since 2 vℓ I. Hence, the strong convexity assumption holds with µ = 1.

Assumption (D): For any ω Ω, we have

EP h |ℓ(ω, 0, x, y)| + vℓ(ω, 0, x, y) 2i = EP

2 fω(x, y) 2 + fω(x, y) 2 < + ,

which holds by the growth assumption on fω(x, y), and P having finite second moments.

Assumption (E): With a perturbation u V we have:

ℓ(ω, v + u, x, y) = 1

2 v z 2 + v z, u + o( u 2), z = fω(x, y),

with o( u 2) = 1

2 u 2. The mapping v 7 v z is continuous, thus the assumption holds.

Assumption (F): For any two points v1, v2 V using the expression of vℓ(ω, v, x, y) = v z with z = fω(x, y) we have:

vℓ(ω, v1, x, y) vℓ(ω, v2, x, y) = (v1 z) (v2 z) = v1 v2 ,

We see that ℓis L-smooth with L = 1 and the assumption holds.

Assumption (K): By the differentiation assumption on fω(x, y), with a perturbation ϵ Ω we can write:

fω+ϵ(x, y) = fω(x, y) + ωfω(x, y)ϵ + o(ϵ).

With a perturbation ϵ u Ω V and substituting fω+ϵ(x, y) with the expression above we have:

ℓ(ω + ϵ, v + u, x, y) =1

2 (v + u) (fω(x, y) + ωfω(x, y)ϵ + o(ϵ)) 2

2 v fω(x, y) 2 + ϵ, ωfω(x, y) (fω(x, y) v)

+ u, v fω(x, y) + o( ϵ + u ),

which allows us to conclude that (ω, v) 7 ℓ(ω, v, x, y) is continuously differentiable on Ω V for all x, y X Y and the assumption holds.

Assumption (G): With a perturbation ϵ u Ω V using the expression of vℓ(ω, v, x, y) we can write:

vℓ(ω + ϵ, v + u, x, y) = (v + u) fω+ϵ(x, y) = (v + u) (fω(x, y) + ωfω(x, y)ϵ + o(ϵ)) = (v fω(x, y)) + u ωfω(x, y)ϵ + o(ϵ),

by continuously differentiable fω(x, y), we have that the assumption holds.

Assumptions (H) and (I): From the expression of vℓ(ω, v, x, y):

ω,vℓ(ω, v, x, y) = ω (v fω(x, y)) = ωfω(x, y),

then using the expression above and the growth assumption on fω (x, y) we have that the two assumptions hold.

Assumption (J): For any ω Ωwe have:

EQ [|ℓ(ω, 0, x, y)|] = EQ

2 fω(x, y) 2 < + ,

by the growth assumption on fω(x, y), and P having finite second moments, thus the assumption is verified.

Assumption (L): Using the growth assumption on fω (x, y), we have the following inequalities:

ωℓ(ω , v, x, y) ωℓ(ω , v , x, y) = fω (x, y) (v v)

fω (x, y) + v v

C (1 + v + v + x + y ) v v ,

vℓ(ω , v, x, y) = v fω (x, y) v + fω (x, y) v + C (1 + x + y ) C (1 + v + x + y )

ωℓ(ω , v, x, y) = ωfω (x, y) (fω (x, y) v)

ωfω (x, y) (fω (x, y) v) ωfω (x, y) ( fω (x, y) + v )

C 1 + v 2 + x 2 + y 2 ,

combining the above with L-smoothness of ℓwe can conclude that the assumption holds.

D.2 Differentiability results

The next lemmas show differentiability of Lout, Lin and h Lin and will be used to prove Proposition 2.3. Lemma D.2 (Differentiability of Lin in its second argument). Under Assumptions (D) to (F), the function h 7 Lin(ω, h) is differentiable in H with partial derivative vector h Lin(ω, h) H given by: h Lin(ω, h) : X V x 7 EP [ vℓin (ω, h(x), x, y) |x] .

Proof. We decompose the proof into three parts: verifying that Lin is well-defined, identifying a bounded map as candidate for the differential and showing that it is the Fréchet differential of Lin.

Well-defined objective. Consider (ω, h) in Ω H. To show that Lin(ω, h) is well-defined, we need to prove that ℓin(ω, h(x), x, y) is integrable under P. We use the following inequalities to control ℓin(ω, h(x), x, y): |ℓin(ω, h(x), x, y)| |ℓin(ω, h(x), x, y) ℓin(ω, 0, x, y)| + |ℓin(ω, 0, x, y)|

0 dt h(x) vℓin(ω, th(x), x, y) + |ℓin(ω, 0, x, y)|

0 dt vℓin(ω, th(x), x, y) vℓin(ω, 0, x, y)

+ h(x) vℓin(ω, 0, x, y) + |ℓin(ω, 0, x, y)|

2 h(x) 2 + 1

h(x) 2 + vℓin(ω, 0, x, y) 2 + |ℓin(ω, 0, x, y)| ,

where the first line follows by triangular inequality, the second follows by application of the fundamental theorem of calculus since ℓin is differentiable by Assumption (E). The third uses Cauchy-Schwarz inequality along with a triangular inequality. Finally, the last line follows using that ℓin is L-smooth in its second argument, locally in ω and uniformly in x and y by Assumption (F). Taking the expectation under P yields: |Lin(ω, h)| EP [|ℓin (ω, h(x), x, y)|]

2 h 2 H + EP h vℓin(ω, 0, x, y) 2 + |ℓin(ω, 0, x, y)| i < + ,

where h H is finite since h H and expectations under P of vℓin(ω, 0, x, y) 2 and |ℓin(ω, 0, x, y)| are finite by Assumption (D). This shows that Lin(ω, h) is well defined on Ω H.

Candidate differential. Fix (ω, h) in Ω H and consider the following linear form din in H:

ding := EP g(x) vℓin(ω, h(x), x, y) , g H.

We need to show that it is a bounded form. To this end, we will show that din is a scalar product with some vector Din in H. The following equalities hold:

ding = EP g(x) vℓin(ω, h(x), x, y)

= EP g(x) EP [ vℓin(ω, h(x), x, y)|x]

= EP g(x) Din(x) ,

where the second line follows by the tower property for conditional expectations and where we define Din(x) := EP [ vℓin(ω, h(x), x, y)|x] in the last line. Din is a the candidate representation of din in H. We simply need to check that Din is an element of H. To see this, we use the following upper-bounds:

EP h Din(x) 2i EP h EP h vℓin(ω, h(x), x, y) 2 x ii

= EP h vℓin(ω, h(x), x, y) 2i

2EP h vℓin(ω, h(x), x, y) vℓin(ω, 0, x, y) 2i + 2EP h vℓin(ω, 0, x, y) 2i

2L2EP h h(x) 2i + 2EP h vℓin(ω, 0, x, y) 2i < + .

The first inequality is an application of Jensen s inequality by convexity of the squared norm. The second line follows by the tower property for conditional probability distributions while the third follows by triangular inequality and Jensen s inequality applied to the square function. The last line uses that ℓin is L-smooth in its second argument, locally in ω and uniformly in x, y by Assumption (F). Since h is square integrable under P by construction and vℓin(ω, 0, x, y) is also square integrable by Assumption (D), we deduce from the above upper-bounds that Din(x) must also be square integrable and thus an element of H. Therefore, we have shown that din is a continuous linear form admitting the following representation:

ding = Din, g H. (29)

Differentiability of h 7 Lin(ω, h). To prove differentiability, we simply control the first order error E(g) defined as:

E(g) := |Lin(ω, h + g) Lin(ω, h) ding| . (30)

For a given g H, the following inequalities hold:

0 dt g(x) ( vℓin(ω, h(x) + tg(x), x, y) vℓin(ω, h(x), x, y))

0 |g(x) ( vℓin(ω, h(x) + tg(x), x, y) vℓin(ω, h(x), x, y)) | dt

2 EP h g(x) 2i = L

where the first inequality follows by application of the fundamental theorem of calculus since ℓin is differentiable in its second argument by Assumption (E). The second line follows by Jensen s inequality while the last line uses that v 7 ℓin(ω, v, x, y) is L-Lipschitz locally in ω and uniformly in x and y by Assumption (F). Therefore, we have shown that E(g) = o( g H) which precisely means that h 7 Lin(ω, h) is differentiable with differential din. Moreover, Din is the partial gradient of Lin(ω, h) in the second variable:

h Lin(ω, h) = Din = x 7 EP [ vℓin(ω, h(x), x, y)|x] .

Lemma D.3 (Differentiability of Lout). Under Assumptions (A), (B) and (J) to (L), Lout is jointly differentiable in ω and h. Moreover, its partial derivatives ωLout(ω, h) and h Lout(ω, h) are elements in Ωand H given by:

ωLout(ω, h) =EQ [ ωℓout (ω, h(x), x, y)] h Lout(ω, h) =x 7 r(x)EQ [ vℓout(ω, h(x), x, y)|x] . (31)

Proof. We follow a similar procedure as in Lemma D.2, where we decompose the proof into three steps: verifying that the objective Lout is well-defined, identifying a candidate for the differential and proving that it is the differential of Lout.

Well-definiteness of the objective. Let (ω, h) be in Ω H. First, note that by Assumption (B), we have that

EQ h h(x) 2i = EP h h(x) 2 r(x) i M h 2 H < + . (32)

The next inequalities control the growth of ℓout:

|ℓout(ω, h(x), x, y)| |ℓout(ω, 0, x, y)| + |ℓout(ω, h(x), x, y) ℓout(ω, 0, x, y)|

|ℓout(ω, 0, x, y)| + Z 1

0 dt h(x) vℓout(ω, th(x), x, y)

|ℓout(ω, 0, x, y)| + h(x) Z 1

0 dt vℓout(ω, th(x), x, y)

|ℓout(ω, 0, x, y)| + C h(x) (1 + h(x) + x + y )

|ℓout(ω, 0, x, y)| + C 1 + 3 h(x) 2 + x 2 + y 2 .

The first line is due to the triangular inequality while the second line follows by differentiability of vℓout in its second argument (Assumption (K)). The third line follows by Cauchy-Scwharz inequality wile the fourth line uses that ℓout has at most a linear growth in its last three arguments by Assumption (L). Using the above inequalities, we get the following upper-bound on Lout:

|Lout(ω, h)| EQ [|ℓout (ω, 0, x, y)|] + C 1 + 3EQ h h(x) 2i + EQ h x 2 + y 2i < + .

In the above upper-bound, EQ h h(x) 2i is finite by Equation (32). Additionally, EQ h x 2 + y 2i

is finite since Q has finite second moments by Assumption (A) while EQ [|ℓout (ω, 0, x, y)|] is also finite by Assumption (J). Therefore, Lout is well defined over Ω H.

Candidate differential. Fix (ω, h) in Ω H and define the following linear form:

dout(ϵ, g) := ϵ EQ [ ωℓout (ω, h(x), x, y)] + EQ g(x) vℓout(ω, h(x), x, y)

Define Dout = (Dω, Dh) to be:

Dω := EQ [ ωℓout (ω, h(x), x, y)] Dh := x 7 r(x)EQ [ vℓout(ω, h(x), x, y)|x] .

By an argument similar to the one in Lemma D.2, we see that dout(ϵ, g) = g, Dh H + ϵ Dω. We now need to show that Dω and Dh are well defined elements of Ωand H.

Square integrability of Dh. We use the following upper-bounds:

EP h Dh(x) 2i EP h r(x)2EQ [ vℓout(ω, h(x), x, y) |x]2i

EP h r(x)2EQ h vℓout(ω, h(x), x, y) 2 x ii

MEP h r(x)EQ h vℓout(ω, h(x), x, y) 2 x ii

= MEQ h vℓout(ω, h(x), x, y) 2i

4MC 1 + EQ h h(x) 2i + EQ h x 2 + y 2i .

The first inequality is an application of Jensen s inequality by convexity of the norm, while the second one is an application of Cauchy-Schwarz inequality. The third line uses that r(x) is upper-bounded by a constant M by Assumption (B), and the fourth line follows from the tower property for conditional probability distributions. Finally, the last line follows by Assumption (L) which ensures that vℓout has at most a linear growth in its last three arguments. By Equation (32), we have that EQ h h(x) 2i < + . Moreover, since Q has finite second order moment by Assumption (A), we

also have that EQ h x 2 + y 2i < + . We therefore conclude that EP h Dh(x) 2i is finite which ensure that Dh belongs to H.

Well-definiteness of Dω. To show that Dω is well defined, we need to prove that (x, y) 7 ωℓout (ω, h(x), x, y) is integrable under Q. By Assumption (L), we know that ωℓout has at most a quadratic growth in it last three arguments so that the following inequality holds.

ωℓout (ω, h(x), x, y) C 1 + h(x) 2 + x 2 + y 2 .

We can directly conclude by taking the expectation under Q in the above inequality and recalling that EQ h h(x) 2i is finite by Equation (32), and that Q has finite second-order moments by Assumption (A).

Differentiability of Lout. Since differentiability is a local notion, we may assume without loss of generality that ϵ 2+ g 2 H 1. Introduce the functions 1 and 2 defined over Ω H, X Y [0, 1] as follows:

1(ϵ, g, x, y, t) := vℓout (ω + tϵ, h(x) + tg(x), x, y) vℓout (ω + tϵ, h(x), x, y)

1(ϵ, g, x, y, t) := vℓout (ω + tϵ, h(x), x, y) vℓout (ω, h(x), x, y) 2(ϵ, g, x, y, t) := ωℓout (ω + tϵ, h(x) + tg(x), x, y) ωℓout (ω + tϵ, h(x), x, y)

2(ϵ, g, x, y, t) := ωℓout (ω + tϵ, h(x), x, y) ωℓout (ω, h(x), x, y) .

We consider the first-order error E(ϵ, g) which admits the following upper-bounds:

E(ϵ, g) := |Lout(ω + ϵ, h + g) Lout(ω, h) dout(ϵ, g)|

0 dt g(x) ( 1 + 1)(ϵ, g, x, y, t) + ϵ ( 2 + 2)(ϵ, g, x, y, t)

0 dt g(x) ( 1(ϵ, g, x, y, t) + 1(ϵ, g, x, y, t) )

0 dt ϵ ( 2(ϵ, g, x, y, t) + 2(ϵ, g, x, y, t) )

EQ h g(x) 2i 1

0 dt 1(ϵ, g, x, y, t) 2 1

| {z } A1(ϵ,g)

0 dt 1(ϵ, g, x, y, t) 2 1

| {z } A2(ϵ,g)

0 dt 2(ϵ, g, x, y, t)

| {z } A3(ϵ,g)

0 dt 2(ϵ, g, x, y, t)

| {z } A4(ϵ,g)

M g H (A1(ϵ, g) + A2(ϵ, g)) + ϵ (A3(ϵ, g) + A4(ϵ, g)) .

The second line uses differentiability of ℓout (Assumption (K)). The third uses the triangular inequality, while the fourth line uses Cauchy-Schwarz inequality. Finally, the last line uses Equation (32).

We simply need to show that each of the terms A1, A2, A3 and A4 converge to 0 as ϵ and g converge to 0. We treat each term separately.

Controlling A1 and A3. For ϵ small enough so that Assumption (L) holds, the following upperbounds on A1 and A2 hold:

A1(ϵ, g) CEQ h g(x) 2i 1

CM 1 2 g H A3(ϵ, g) CEQ [(1 + h(x) + tg(x) + h(x) + x + y ) g(x) ]

CEQ h g(x) 2i 1

2 1 + 2EQ h h(x) 2i 1

2 + EQ h g(x) 2i 1

2 + EQ h x 2 + y 2i 1

1 + M 1 2 + 2EQ h h(x) 2i 1

2 + EQ h x 2 + y 2i 1

For A1, we used that vℓout has is Lipschitz continuous in its second argument for any x, y X Y and locally in ω by Assumption (L). The second upper-bound on A1 uses Equation (32). For A2, we used the locally Lipschitz property of ωℓout from Assumption (L), followed by Cauchy-Schwarz inequality and Equation (32). For the last line, we also used that g H 1 by assumption. The above upper-bounds on A1 and A3 ensure that these quantities converge to 0 as ϵ and g approach 0.

Controlling A2 and A4. To show that A2 and A4 converge to 0, we will use the dominated convergence theorem. It is easy to see that 1 (ϵ, g, x, y, t) and 2 (ϵ, g, x, y, t) converge point-wise to 0 when ϵ and g converge to 0 since (ω, v) 7 vℓout(ω, v, x, y) and (ω, v) 7 ωℓout(ω, v, x, y) are continuous by Assumption (K). It remains to dominate these functions. For ϵ small enough so that Assumption (L) holds, we have that:

1 (ϵ, g, x, y, t)2 16C2 1 + h(x) 2 + x 2 + y 2

2 (ϵ, g, x, y, t) 2C 1 + h(x) 2 + x 2 + y 2 .

Both upper-bounds are integrable under Q since EQ h h(x) 2i < + by Equation (32) and Q has finite second-order moment by Assumption (A). Therefore, by the dominated convergence theorem, we deduce that A2 and A4 converge to 0 as ϵ and g approach 0.

Finally, we have shown that E(ϵ, g) = o ( ϵ + g H) which allows to conclude that Lout is differentiable with the partial derivatives given by Equation (31).

Lemma D.4 (Differentiability of h Lin). Under Assumptions (A) and (E) to (I), the differential map (ω, h) 7 h Lin(ω, h) defined in Lemma D.2 is differentiable on Ω H in the sense of Definition C.1. Its differential d(ω,h) h Lin(ω, h) : Ω H H acts on elements (ϵ, g) Ω H as follows:

d(ω,h) h Lin(ω, h)(ϵ, g) = 2 h Lin(ω, h)g + ( ω,h Lin(ω, h)) ϵ, (34)

where 2 h Lin(ω, h) : H H is a linear symmetric operator representing the partial derivative of h Lin(ω, h) w.r.t h and ( ω,h Lin(ω, h)) is the adjoint of ω,h Lin(ω, h) : H Ωwhich represents the partial derivative of h Lin(ω, h) w.r.t ω. Moreover, 2 h Lin(ω, h) and ω,h Lin(ω, h) are given by:

2 h Lin(ω, h)g =x 7 EP 2 vℓin(ω, h(x), x, y) x g(x) (35)

ω,h Lin(ω, h)g =EP [ ω,vℓin(ω, h(x), x, y)g(x)] , (36)

Proof. Let (ω, h) be in Ω H. To show that ωLin is Hadamard differentiable, we proceed in two steps: we first identify a candidate differential and show that it is a bounded operator, then we prove Hadamard differentiability.

Candidate differential. For a given (ω, h) Ω H, we consider the following linear operators Cw,h : H H and Bw,h : H Ω:

Cω,hg = EP 2 vℓin(ω, h(x), x, y) x g(x), Bω,hg = EP [ ω,vℓin(ω, h(x), x, y)g(x)] ,

where the expectations are over y conditionally on x. Next, we show that Cω,h and Bω,h are well-defined and bounded.

Well-definiteness of the operator Cω,h. The first step is to show that the image Cω,hg of any element g H by Cω,h is also an element in H. To this end, we simply need to find a finite upper-bound on Cω,hg H for a given g H:

Cω,hg 2 H = EP h EP 2 vℓin(ω, h(x), x, y) x g(x) 2i

EP h EP 2 vℓin(ω, h(x), x, y) x 2 op g(x) 2i

EP h 2 vℓin(ω, h(x), x, y) op

x i2 g(x) 2

EP h 2 vℓin(ω, h(x), x, y) 2 op g(x) 2i

The second line follows using the operator norm inequality, the third line follows by Jensen s inequality applied to the norm, while the fourth uses the tower property for conditional distributions. Finally, the last line uses that 2 vℓin is upper-bounded uniformly in x and y by Assumption (F). Therefore, we conclude that Cω,hg belongs to H. Moreover, the inequality Cω,hg H L g H also establishes the continuity of the operator Cω,h.

Well-definiteness of the operator Bω,h. We first show that the image Bω,h is bounded. For a given g in H, we write:

Bω,hg = EP [ ω,vℓin(ω, h(x), x, y)g(x)] EP [ ω,vℓin(ω, h(x), x, y)g(x) ]

EP h ω,vℓin(ω, h(x), x, y) op g(x) i

g H EP h ω,vℓin(ω, h(x), x, y) 2 op i 1

g H EP h ω,vℓin(ω, h(x), x, y) ω,vℓin(ω, 0, x, y) 2 op i 1

+ g H EP h ω,vℓin(ω, 0, x, y) 2 op i 1

EP h h(x) 2i 1

2 + EP h (1 + x + y )2i 1

C g H h H + 2EP h 1 + x 2 + y 2i < + .

In the above expression, the second line is due to Jensen s inequality applied to the norm function, the third line follows from the operator norm inequality, while the fourth follows by Cauchy-Schwarz. The fifth line is due to the triangular inequality. Finally, the sixth line relies on two facts: 1) that v 7 ω,vℓin(ω, v, x, y) is Lipschitz uniformly in x and y and locally in ω by Assumption (I), and, 2) that ω,vℓin(ω, 0, x, y) has at most a linear growth in x and y locally in ω by Assumption (H). Since P has finite second order moments by Assumption (A) and both h and g are square integrable, we conclude that the constant Bω,h is finite. Moreover, the last inequality establishes that Bω,h is a continuous linear operator from H to Ω. One can then see that the adjoint of Bω,h admits a representation of the form:

(Bω,h) ϵ := ( ω,h Lin(ω, h)) ϵ = x 7 EP h ( ω,vℓin(ω, h(x), x, y)) x i ϵ.

Therefore, we can consider the following candidate operator d2 in for the differential of h Lin:

d2 in(ϵ, g) := Cω,hg + (Bω,h) ϵ.

Differentiablity of h Lin. We will show that h Lin is jointly Hadamard differentiable at (ω, h) with differential operator given by:

d(ω,h) h Lin(ω, h)(ϵ, g) = Cω,hg + (Bω,h) ϵ. (37)

To this end, we consider a sequence (ϵk, gk)k 1 converging in Ω H towards an element (ϵ, g) Ω H and a non-vanishing real valued sequence tk converging to 0. Define the first-order error Ek as follows:

Ek := 1 tk ( h Lin(ω + tkϵk, h + tkgk) h Lin(ω, h)) Cω,hg (Bω,h) ϵ

Introduce the functions P1, P2, 1 and 2 defined over N , X Y [0, 1] as follows:

P1(k, x, y, s) = 2 vℓin(ω + stkϵk, h(x) + stkgk(x), x, y), k 1 2 vℓin(ω, h(x), x, y), k = 0

P2(k, x, y, s) =

( ( ω,vℓin(ω + stkϵk, h(x) + stkgk(x), x, y)) k 1 ( ω,vℓin(ω, h(x), x, y)) , k = 0.

1(k, x, y, s) = P1(k, x, y, s) P1(0, x, y, s), 2(k, x, y, s) = P2(k, x, y, s) P2(0, x, y, s).

By joint differentiability of (ω, v) 7 vℓin(ω, v, x, y) (Assumption (E)), we use the fundamental theorem of calculus to express Ek in terms of 1 and 2:

0 dt (P1(k, x, y, s)gk(x) P1(0, x, y, s)g(x) + P2(k, x, y, s)ϵk P2(0, x, y, s)ϵ) x

0 dt P1(k, x, y, s)gk(x) P1(0, x, y, s)g(x) + P2(k, x, y, s)ϵk P2(0, x, y, s)ϵ 2 x

0 dt P1(k, x, y, s)gk(x) P1(0, x, y, s)g(x) + P2(k, x, y, s)ϵk P2(0, x, y, s)ϵ 2

0 dt 1(k, x, y, t) 2 op g(x) 2

0 dt 2(k, x, y, t) 2 op

0 dt P1(k, x, y, t) 2 op g(x) gk(x) 2

+4 ϵ ϵk 2 EP

0 dt P2(k, x, y, t) 2 op

The second line uses Jensen s inequality applied to the squared norm, the fourth line results from the tower property of conditional distributions. The fifth line uses Jensen s inequality for the square function followed by the operator norm inequality. It remains to show that A(1) k , B(1) k and A(2) k converge to 0 and that B(2) k is bounded.

Upper-bound on A(1) k . We will use the dominated convergence theorem. Assumption (F) ensures the existence of a positive constant L and a neighborhood B of ω so that v 7 2 vℓin(ω , v, x, y) op is bounded by L for any ω , x, y B X Y. Since ω + tkϵk ω, then there exists some K0 so that, for any k K0, we can ensure that ω + tkϵk B. This allows us to deduce that:

1(k, x, y, t) 2 op g(x) 2 4L2 g(x) 2 , (38)

for any k K0 and any x, y X Y, with g(x) 2 being integrable under P.

Moreover, we also have the following point-wise convergence for P-almost all x X:

1(k, x, y, t) 2 op g(x) 2 0. (39)

Equation (39) follows by noting that ω + tkϵk ω and that h(x) + tkgk(x) h(x) for P-almost all x X, since tk converges to 0, ϵk converges to ϵ and gk converges to g in H (a fortiori converges point-wise for P-almost all x X). Additionally, the map (ω, v) 7 2 vℓin(ω, v, x, y) op is

continuous by Assumption (G), which allows to establish Equation (39). From Equations (38) and (39) we can apply the dominated convergence theorem which allows to deduce that A(1) k 0.

Upper-bound on A(2) k . By a similar argument as for A(1) k and using Assumption (F), we know that there exists K0 > 0 so that for any k K0:

P1(k, x, y, t) 2 op L2. (40)

Therefore, we directly get that:

A(2) k L2 g(x) gk(x) 2 H 0, (41)

where we used that gk g by construction.

Upper-bound on B(2) k . We will show that P2 (k, x, y, t) op is upper-bounded by a square integrable function under P. By Assumptions (H) and (I), there exists a neighborhood B and a positive constant C such that, for all ω , v1, v2, x, y B V V X Y:

ω,v1ℓin (ω , 0, x, y) C (1 + x + y ) (42)

ω,vℓin (ω , v1, x, y) ω,vℓin (ω , v2, x, y) C v1 v2 (43)

By a similar argument as for A(1) k , there exists K0 so that for any k K0, the above inequalities hold when choosing ω = ω + tkϵk. Using this fact, we obtain the following upper-bound on P2 (k, x, y, t) op for k K0:

P2 (k, x, y, t) op ω,vℓin(ω + stkϵk, h(x) + stkgk(x), x, y) ω,vℓin(ω + stkϵk, 0, x, y) op + ω,vℓin(ω + stkϵk, 0, x, y) op C (1 + h(x) + stkgk(x) + x + y ) C (1 + h(x) + tk gk(x) + x + y )

Therefore, by taking expectations and integrating over t, it follows:

B(2) k C2EP h (1 + h(x) + tk gk(x) + x + y )2i

4C2EP h 1 + h(x) 2 + t2 k gk(x) 2 + x 2 + y 2 i .

By construction t2 k gk(x) 2 0 and is therefore a bounded sequence. Moreover, EP h h(x) 2i is

finite since h belongs to H. Finally, EP h x 2 + y 2i < + by Assumption (A). Therefore, we

have shown that B(2) k is bounded.

Upper-bound on B(1) k . By a similar argument as for B(2) k and using again Assumptions (H) and (I), there exists K0 so that for any k K0:

2(k, x, y, t) op ω,vℓin(ω + stkϵk, h(x) + stkgk(x), x, y) ω,vℓin(ω + stkϵk, h(x), x, y) op + ω,vℓin(ω + stkϵk, h(x), x, y) ω,vℓin(ω, h(x), x, y) op Ctk gk(x) + ω,vℓin(ω + stkϵk, h(x), x, y) ω,vℓin(ω, h(x), x, y) op ,

where we used Equation (43) to get an upper-bound on the first terms. By squaring the above inequality and taking the expectation under P we get:

B(1) k 2Ctk gk 2 H + 2EP

ω,vℓin(ω + stkϵk, h(x), x, y) ω,vℓin(ω, h(x), x, y) 2 op | {z } ek(x,y)

We only need to show that EP [ek(x, y)] converges to 0 since the first term 2Ctk gk 2 H already converges to 0 by construction of tk and gk. To achieve this, we will use the dominated convergence theorem. It is easy to see that ek(x, y) converges to 0 point-wise by continuity of ω 7 ω,vℓin(ω, v, x, y)

(Assumption (G)). Therefore, we only need to show that ek(x, y) is dominated by an integrable function. Provided that k K0, we can use Equations (42) and (43) to get the following upper-bounds:

1 4ek(x, y) ω,vℓin(ω + stkϵk, h(x), x, y) ω,vℓin(ω + stkϵk, 0, x, y) 2 op

+ ω,vℓin(ω, h(x), x, y) ω,vℓin(ω, 0, x, y) 2 op + ω,vℓin(ω, 0, x, y) 2 op + ω,vℓin(ω + stkϵk, 0, x, y) 2 op

2C2 1 + h(x) 2 + x 2 + y 2 .

The l.h.s. of the last line is an integrable function that is independent of k, since h is square integrable by definition and x 2 + y 2 are integrable by Assumption (A). Therefore, by application of the dominated convergence theorem, it follows that EP [ek(x, y)] 0, we have shown that B(1) k 0.

To conclude, we have shown that the first-order error Ek converges to 0 which means that (ω, h) 7 h Lin(ω, h) is jointly differentiable on Ω H, with differential given by Equations (34) and (35).

D.3 Proof of Proposition 2.3

Proof. The strategy is to show that the conditions on Lin and Lout stated in Proposition 2.2 hold. By Assumption (C), for any ω Ω, there exists a positive constant µ and a neighborhood B of ω on which the function ℓin(ω , v, x, y) is µ-strongly convex in v for any (ω , x, y) B X Y. Therefore, by integration, we directly deduce that h 7 Lin(ω , h) is µ strongly convex in h for any ω B. By Lemmas D.2 and D.4, h 7 Lin(ω, h) is differentiable on H for all ω Ωand h Lin is Hadamard differentiable on Ω H. Additionally, Lout is jointly differentiable in ω and h by Lemma D.3. Therefore, the conditions on Lin and Lout for applying Proposition 2.2 hold. Using the notations from Proposition 2.2, we have that the total gradient F(ω) can be expressed as:

F(ω) = gω + Bωa ω (45)

where gω = ωLout(ω, h ω), Bω = ω,h Lin(ω, h ω) and where a ω is the minimizer of the adjoint objective Ladj:

Ladj(ω, a) := 1

2 a Cωa + a dω,

with Cω = 2 h Lin(ω, h ω) and dω = h Lout(ω, h ω). Recalling the expressions of the first and second order differential operators from Lemmas D.2 and D.4, we deduce the expression of the adjoint objective as a sum of two expectations under P and Q given the optimal prediction function

Ladj(ω, a) = 1

2 E(x,y) P a(x) 2 vℓin (ω, h ω(x), x, y) a(x)

+ E(x,y) Q a(x) vℓout (ω, h ω(x), x, y) .

Furthermore, the vectors gω and Bωa ω appearing in Equation (45) can also be expressed as expectations:

gω = E(x,y) Q [ ωℓout (ω, h ω(x), x, y)]

Bωa ω = E(x,y) P [ ω,vℓin (ω, h ω(x), x, y) a ω(x)] .

E Convergence Analysis

We provide a convergence result of Algorithm 1 to stationary points of F. Our analysis uses the framework of biased stochastic gradient descent (Biased SGD) [Demidovich et al., 2024] where the bias arises from suboptimality errors when solving the inner-level and adjoint problems.

E.1 Setup and assumptions

Gradient estimators. Recall that the total gradient F(ω) admits the following expression under Assumptions (A) to (L):

F(ω) = EQ [ ωℓout (ω, h ω(x), x, y)] + EP [ ω,vℓin (ω, h ω(x), x, y) a ω(x)] ,

We denote by ˆg(ω) the gradient estimator, i.e., the mapping ˆg : Ω Ωcomputed by Algorithm 4 and which admits the following expression:

ˆg(ω) = 1 |Bout| X

( x, y) Bout ωℓout ω, ˆhω( x), x, y + 1 |Bin| X

(x,y) Bin ω,vℓin ω, ˆhω(x), x, y ˆaω(x),

where Bin and Bout are samples from P and Q independent from ˆhω and ˆaω and independent from each other (i.e. Bin Bout). Here, there are three independent sources of randomness when computing ˆg(ω): estimation of ˆhω and ˆaω in Algorithms 2 and 3, as well as random batches Bin and Bout. We denote by E[ ] the expectation with respect to all random variables appearing in the expression of ˆg(ω), and by E[ |ˆhω] and E[ |ˆhω, ˆaω] the conditional expectations knowing ˆhω only and both ˆhω and ˆaω. ˆg(ω) is a biased estimator of F(ω) (i.e., E[ˆg(ω)] is not equal to F(ω)), as the bias is due to using sub-optimal solutions ˆhω and ˆaω instead of h ω and a ω in the expression of ˆg(ω). Furthermore, we define G(ω) := E h ˆg(ω)|ˆhω, ˆaω i to be the conditional expectation of ˆg(ω)

at ω given estimates ˆhω and ˆaω of h ω and a ω. By independence of Bin and Bout from ˆhω and ˆaω, G(ω) admits the following expression:

G(ω) =EQ h ωℓout ω, ˆhω(x), x, y i + EP h ω,vℓin ω, ˆhω(x), x, y ˆaω(x) i ,

where the expectation is taken w.r.t. (x, y) Q.

Approximate adjoint objective.

We introduce the approximate adjoint objective Ladj(ω, a) where h ω is replaced by ˆhω:

Ladj(ω, a) = 1

2a 2 h Lin(ω, ˆhω)a + a T h Lout(ω, ˆhω). (46)

By independence of the estimator ˆh and the samples B used for computing ˆLadj(ω, a, ˆhω, B) it is

easy to see that E h ˆLadj(ω, a, ˆhω, B) ˆhω i = Ladj(ω, a). Hence, it is natural to think of ˆaω as an

approximation to the minimizer aω of Ladj(ω, a) in H.

Sub-optimality assumption.

The following assumption quantifies the sub-optimality errors made by ˆhω and ˆaω.

(a) For some positive constants ϵin and ϵadj and for all ω Ω, ˆhω and ˆaω satisfy:

E h Lin(ω, ˆhω) Lin(ω, h ω) i ϵin, E h Ladj(ω, ˆaω) Ladj(ω, aw) i ϵadj.

Assumptions on ℓin.

(b) (Strong convexity) ℓin(ω, v, x, y) is µ-strongly convex in v V.

(c) v 7 2 vℓin(ω, v, x, y) is C1-Lipschitz on Ω X Y. (d) v 7 ω,vℓin(ω, v, x, y) is differentiable and C2-Lipschitz and bounded by B2 R.

Assumptions on ℓout.

(e) h Lout (ω, h ω) H is bounded by a positive constant B3. (f) v 7 vℓout(ω, v, x, y) is C4-Lipschitz for all (ω, x, y) Ω X Y. (g) v 7 ωℓout(ω, v, x, y) is differentiable and C3-Lipschitz for all (ω, x, y) Ω X Y.

(h) (x, y) 7 ωℓout(ω, h ω(x), x, y) has a variance bounded by σ2 out for all ω Ω.

Smoothness of the total objective.

(i) Function F(ω) is L-smooth for all ω Ω, and bounded from below by F R. Remark E.1. Assumption (a) reflects the generalization errors made when optimizing Lin and Ladj using Algorithms 2 and 3. In the case of over-parameterized networks, these errors can be made smaller by increasing network capacity, number of steps and the number of samples [Allen-Zhu et al., 2019, Du et al., 2019, Zou et al., 2020]. Remark E.2. Assumptions (b) to (h) are similar in spirit to those used for analyzing bi-level optimization algorithms (ex: [Arbel and Mairal, 2022a, Assumptions 1 to 5]). In particular, Assumption (e) is even weaker than [Arbel and Mairal, 2022a, Assumptions 2] where h Lout(ω, h) needs to be bounded for all ω, h in Ω H. For instance, Assumption (e) trivially holds when h Lout(ω, h) is a linear transformation of h Lin(ω, h), as is the case for min-max problems. There h Lin(ω, h ω) = 0 so that h Lout(ω, h ω) = 0 is bounded, while h Lout(ω, h) might not be bounded in general.

E.2 Proof of the main result

The general strategy of the proof is to first show that the conditions for applying the general convergence result for biased SGD of Demidovich et al. [2024, Theorem 3] hold. We start with Proposition E.3 which shows that the biased gradient estimate ˆg(ω) satisfies the conditions of Demidovich et al. [2024, Assumption 9]. Proposition E.3. Let Assumptions (a) to (h) and Assumptions (A) to (L) hold.

F(ω) E[ˆg(ω)] 1

2 (c1ϵin + c2ϵout) (47)

E h ˆg(ω) 2i σ2 0 + 2 (c1ϵin + c2ϵadj) | {z } σ2 eff

+2 F(ω) 2 , (48)

where c1 and c2 are constants defined in Equation (50) and σ0 is a positive constant defined in Equation (51).

Proof. We prove each bound separately.

Lower-bound on F(ω) E[ˆg(ω)]. Fist note that E[ˆg(ω)] = E[G(ω)]. Hence, by direct calculation, we have that: F(ω) E[ˆg(ω)] = F(ω) E[G(ω)]

2 F(ω) 2 + E h G(ω) 2i 1

2E h G(ω) F(ω) 2i

2 (c1ϵin + c2ϵout) , where we use Lemma E.4 to get the last lower-bound.

Upper-bound on E[ ˆg(ω) 2].

E h ˆg(ω) 2i = E h ˆg(ω) G(ω) 2i + E h G(ω) F(ω) + F(ω) 2i

E h ˆg(ω) G(ω) 2i + 2E h G(ω) F(ω) 2i + 2 F(ω) 2

σ2 0 + 2 (c1ϵin + c2ϵadj) + 2 F(ω) 2 , where the first line uses that E[ˆg(ω)] = E[G(ω)] and the last line uses Lemmas E.4 and E.5.

We can now directly use Proposition E.3 and Assumption (i) on F to prove the Theorem 3.1 using the biased SGD convergence result in [Demidovich et al., 2024, Theorem 3].

Proof of Theorem 3.1. The proof is a direct application of [Demidovich et al., 2024, Theorem 3] given that the variance and bias conditions on the estimator ˆg(ω) are satisfied by Proposition E.3 and that F is L-smooth and has a finite lower-bound F R by Assumption (i).

E.3 Bias-variance decomposition.

Lemma E.4 (Bias control). Let Assumptions (a) to (g) and Assumptions (A) to (L) hold.

E h G(ω) F(ω) 2i c1ϵin + c2ϵadj, (49)

where c1 and c2 are non-negative constants defined in Equation (50).

G(ω) F(ω) = EQ h ωℓout ω, ˆhω(x), x, y ωℓout (ω, h ω(x), x, y) i

+ EP h ω,vℓin ω, ˆhω(x), x, y ˆaω(x) ω,vℓin (ω, h ω(x), x, y) a ω(x) i

=A1 + EP h ω,vℓin ω, ˆhω(x), x, y ω,vℓin (ω, h ω(x), x, y) a ω(x) i

+ EP h ω,vℓin ω, ˆhω(x), x, y (ˆaω(x) a ω(x)) i

We have the following:

A1 MC3 ˆhω h ω H

A2 C2 a ω H ˆhω h ω H C2B3µ 1 ˆhω h ω H A3 B2EP [ ˆaω(x) a ω(x) ] B2 (EP [ ˆaω(x) aω(x) ] + EP [ a ω(x) aω(x) ])

B2 ˆaω aω H + aω a ω L1 .

The first inequality holds since v 7 ωℓout(ω, v, x, y) is C3-Lipschitz by Assumption (g) and d Q( , Y) admits a density w.r.t d P( , Y) bounded by a positive constant M by Assumption (B). The second inequality holds since v 7 ω,vℓin(ω, v, x, y) is C2-Lipschitz by Assumption (d) and a ω H is upper-bounded by µ 1B3 as a consequence of Lemma E.6. Finally, the inequality on A3 holds since ω,vℓin(ω, v, x, y) is bounded by a constant B2 by Assumption (d). Therefore, it holds that the difference between G(ω) and F satisfies:

G(ω) F(ω) 2 3 MC3 + C2B3

2 ˆhω h ω 2

H + 3B2 2 ˆaω aω 2 H + 3B2 2 aω a ω 2 L1 .

Taking the expectation over ˆhω and ˆaω and using the bounds in Lemma E.6 yields:

E h G(ω) F(ω) 2i 6 MC3 + C2B3µ 1 2 µ 1ϵin + 6B2 2µ 1ϵadj

+ 6B2 2µ 1 µ 1C4M + µ 2C1B3 2 ϵin. Finally, the upper bound on the bias holds with c1 and c2 defined as:

c1 := 6 MC3 + C2B3µ 1 2 µ 1 + 6B2 2µ 1 µ 1C4M + µ 2B3C1 2 , c2 = 6B2 2µ 1. (50)

Lemma E.5 (Variance control). Let Assumptions (a) to (h) and Assumptions (A) to (L) hold, then the variance is upper-bounded as follows:

E h ˆg(ω) G(ω) 2i σ2 0,

where σ0 is a positive constant given by:

σ2 0 := 2 |Bout| 2C2 3µ 1ϵin + σ2 out + 4B2 2 |Bin| µ 1ϵadj + 2µ 3C2 4M 2ϵin + µ 2B2 3 (51)

Proof. By definition of ˆg(ω) and G(ω) we have that:

E h ˆg(ω) G(ω) 2i = 1 |Bout| E ωℓout ω, ˆhω(x), x, y EQ h ωℓout ω, ˆhω(x ), x , y ˆhω i 2

+ 1 |Bin| E ω,vℓin ω, ˆhω(x), x, y ˆaω(x) 2

1 |Bin| E EP h ω,vℓin ω, ˆhω(x), x, y ˆaω(x) ˆhω, ˆaω i 2

V1 |Bout| + V2 |Bin|.

Where the first line is a direct consequence of the independence of Bin and Bout. Moreover, we can bound V1 and V2 as follows:

V1 2E ωℓout ω, ˆhω(x), x, y ωℓout (ω, h ω(x), x, y) 2

+ 2EQ h ωℓout (ω, h ω(x), x, y) EQ [ ωℓout (ω, h ω(x ), x , y )] 2i

2C2 3E ˆhω h ω 2 + 2σ2 out 4C2 3µ 1ϵin + 2σ2 out

V2 B2 2E h ˆaω 2 H i 2B2 2 E h ˆaω aω 2 H i + E h aω 2 H i

4B2 2 µ 1ϵadj + 2µ 3C2 4M 2ϵin + µ 2B2 3 .

For V1, we used that v 7 ωℓout(ω, v, x, y) is C3-Lipschitz uniformly in ω, x and y by Assumption (g), and that (x, y) 7 ωℓout(ω, h ω(x), x, y) has a variance uniformly bounded by σ2 out as a consequence of Assumption (h). For V2, we use Assumption (d) where we have that ω,vℓin is

uniformly bounded by a constant B2 and apply the bounds on E h aω 2 H i and E h ˆaω aω 2 H i

from Lemma E.6.

Lemma E.6. For the optimal prediction and adjoint functions h ω, a ω defined in Equations (6) and (7), as well as their estimated versions ˆhω, ˆaω given in Algorithm 1. Under Assumptions (a) to (f) and Assumptions (A) to (L), the following holds:

E ˆhω h ω 2

2µ 1ϵin, E h ˆaω aω 2 H i 2µ 1ϵadj,

E h aω a ω 2 L1

i 2 µ 2C1B3 + µ 1C4M 2 µ 1ϵin,

E h aω 2 H i 4µ 3C2 4M 2ϵin + 2µ 2B2 3, a ω H µ 1B3.

Proof. We show each of the upper bounds separately.

Upper-bound on E ˆhω h ω 2

. We use the strong-convexity Assumption (b) and Assump-

tion (a) to show the first bound:

E ˆhω h ω 2

E h 2 µ Lin(ω, ˆhω) Lin(ω, h ω) i 2ϵin

Upper-bound on E h ˆaω aω 2 H i . The second bound can be proven in the same way as the first,

using that, by definition, Ladj is continuous and µ strongly convex in a together with Assumption (a).

Upper-bound on E h aw a ω 2 L1

i . We exploit the closed form expressions of aω and a ω:

2 h Lin ω, ˆhω

| {z } ˆ Hω

h Lout ω, ˆhω

2 h Lin (ω, h ω) | {z } Hω

h Lout (ω, h ω) | {z } bω

By standard linear algebra, the difference aω a ω can be expressed as:

aω a ω = H 1 ω ˆHω Hω ˆH 1 ω bω + ˆH 1 ω bω ˆbω .

By taking the L1(P) norm of the above and using the upper-bounds from Lemma E.7, we can write:

aω a ω L1 H 1 ω ˆHω Hω ˆH 1 ω bω L1 + ˆH 1 ω bω ˆbω L1

µ 1 ˆHω Hω ˆH 1 ω bω L1 + µ 1 bω ˆbω L1

µ 1C1 ˆhω h ω H

ˆH 1 ω bω H + µ 1 bω ˆbω H

µ 2C1 ˆhω h ω H bω H + µ 1C4M ˆhω h ω H

µ 2C1B3 + µ 1C4M ˆhω h ω H .

The first line we used the triangular inequality, the second line follows from Equation (53) from Lemma E.7. The third line applies Equation (54) from Lemma E.7 to the first term and uses that bω ˆbω L1 bω ˆbω H by Cauchy-Schwarz inequality. The fourth line uses that ˆH 1 ω bω H µ 1 bω H for the first term since ˆH 1 ω op µ 1 by Assumption (b) and uses

Equation (55) from Lemma E.7 for the second term. The final bound is obtained using Assumption (e) to upper-bound bω H by B3. By taking the expectation w.r.t. ˆhω and using Equation (52), we get:

E h aω a ω 2 L1

i µ 2C1B3 + µ 1C4M 2 E ˆhω h ω 2

2 µ 2C1B3 + µ 1C4M 2 µ 1ϵin.

Upper-bound on E h aω 2 H i . We use the closed-form expression of aω:

aω H = ˆH 1 ω ˆbω H µ 1 ˆbω H

µ 1 ˆbω bω H + bω H

µ 1 C4M ˆhω h ω H + B3 ,

where the first line uses that ˆH 1 ω op µ 1 by Assumption (b), the second line follows by

triangular inequality while the last line uses Equation (55) from Lemma E.7 for the first term and Assumption (e) for the second terms. By squaring the above bound and taking expectation w.r.t. ˆhω, we get:

E h aω 2 H i E µ 1C4M ˆhω h ω + µ 1B3 2

2µ 2C2 4M 2E ˆhω h ω 2 + 2µ 2B2 3

4µ 3C2 4M 2ϵin + 2µ 2B2 3,

where the last line uses Equation (52).

Upper-bound on a ω H. Using the closed-form expression of a , it holds that:

a ω H = H 1 ω bω H µ 1B3,

where we used that H 1 ω op µ 1 by Assumption (b) and that bω H B2 by Assumption (d).

Lemma E.7. Consider the operators Hω, ˆHω, bω,ˆbω defined in Lemma E.6. Under Assumptions (b), (c) and (f) and Assumptions (A) to (L), the following holds for any (ω, s) Ω H:

H 1 ω s L1 µ 1 s L1, ˆH 1 ω s L1 µ 1 s L1 (53)

ˆHω Hω s L1 C1 ˆhω h ω H s H , (54)

bω ˆbω H C4M ˆhω h ω H . (55)

Proof. We show each of the upper bounds separately. Upper-bound on H 1 ω s L1 and ˆH 1 ω s L1 . Using strong convexity Assumption (b), we have the

following inequality for the Hessian operator Hω acting on a function c H L1(P):

Hωc L1 = EP EP 2 vℓin(ω, h ω(x), x, y) x c(x) µ c L1 .

by positive-definiteness of Hω, we take c = H 1 ω s for some s H: H 1 ω s L1 µ 1 s L1 .

By the same arguments, the above bound applies to ˆH 1 ω s L1 .

Upper-bound on ˆHω Hω c L1. We can express the operator ˆHω Hω acting on some

s H as follows: ˆHω Hω s = EP h EP h 2 vℓin(ω, ˆhω(x), x, y) 2 vℓin(ω, h ω(x), x, y) x i s(x) i .

Using Assumption (c) we upper-bound the L1 norm of the above quantity as follows: ˆHω Hω s L1 C1EP h ˆhω(x)) h ω(x) s(x) i C1 ˆhω h ω H s H ,

where we used Cauchy-Schwarz inequality to get the last inequality.

Upper-bound on bω ˆbω H. Using Lemma D.3 to get an expression of bω and ˆbω, we obtain the following upper-bound:

EQ h vℓout(ω, h ω(x), x, y) vℓout(ω, ˆhω(x), x, y) |x i r(x) 2

C2 4M 2 ˆhω h ω 2

where we used that the density r(x) is upper-bounded by a positive constant M by Assumption (B) and that vℓout is C4 Lipschitz in its second argument by Assumption (f).

F Connection with Parametric Implicit Differentiation

F.1 Parametric approximation of the functional bilevel problem

In this section, we approximate the functional problem in Equation (FBO) with a parametric bilevel problem where inner-level functions are parametrized as h(x) = τ(θ)(x) with parameters θ. Here, the inner-level variable is θ instead of the function h. Standard bilevel optimization algorithms like AID can be applied, which involve differentiating twice with respect to the parametric model. However, for models like deep neural networks, the inner objective may not be strongly convex in θ, leading to a non-positive or degenerate Hessian (Proposition F.1). This can cause numerical instabilities and divergence from the gradient in Equation (4) (Proposition F.2), especially when using AID, which relies on solving a quadratic problem defined by the Hessian.

If the model has multiple solutions, the Hessian may be degenerate, making the implicit function theorem inapplicable. In contrast, functional implicit differentiation requires solving a positive definite quadratic problem in H to find an adjoint function a ω, ensuring a solution even when h ω is sub-optimal, due to the strong convexity of Lin(ω, h). This stability with sub-optimal solutions is crucial for practical algorithms like the one in Section 3, where the optimal prediction function is approximated within a parametric family, such as neural networks.

Formally, to establish a connection with parametric implicit differentiation, let us consider τ : Θ 7 H to be a map from a finite dimensional set of parameters Θ to the functional Hilbert space H and define a parametric version of the outer and inner objectives in Equation (FBO) restricted to functions in HΘ := {τ(θ) | θ Θ}: Gout(ω, θ) := Lout (ω, τ(θ)) Gin(ω, θ) := Lin (ω, τ(θ)) . (56) The map τ can typically be a neural network parameterization and allows to obtain a more tractable approximation to the abstract solution h ω in H where the function space H is often too large to perform optimization. This is typically the case when H is an L2-space of functions as we discuss in more details in Section 3. When H is a Reproducing Kernel Hilbert Space (RKHS), τ may also correspond to the Nyström approximation [Williams and Seeger, 2000], which performs the optimization on a finite-dimensional subspace of an RKHS spanned by a few data points.

The corresponding parametric version of the problem (FBO) is then formally defined as: min ω ΩGtot(ω) := Gout(ω, θ ω)

s.t. θ ω arg min θ Θ Gin(ω, θ). (PBO)

The resulting bilevel problem in Equation (PBO) often arises in machine learning but is generally ambiguously defined without further assumptions on the map τ as the inner-level problem might admit multiple solutions [Arbel and Mairal, 2022b]. Under the assumption that τ is twice continuously differentiable and the rather strong assumption that the parametric Hessian 2 θGin(ω, θ ω) is invertible for a given ω, the expression for the total gradient ωGtot(ω) follows by direct application of the parametric implicit function theorem [Pedregosa, 2016]: ωGtot(ω) = ωGout(ω, θ ω) + ω,θGin(ω, θ ω)u ω u ω = 2 θGin(ω, θ ω) 1 θGout(ω, θ ω), (57)

where u ω is the adjoint vector in Θ. Without further assumptions, the expression of the gradient in Equation (57) is generally different from the one obtained in Proposition 2.2 using the functional point of view. Nevertheless, a precise connection between the functional and parametric implicit gradients can be obtained under expressiveness assumptions on the parameterization τ, as discussed in the next two propositions. Proposition F.1. Under the same assumptions as in Proposition 2.2 and assuming that τ is twice continuously differentiable, the following expression holds for any (ω, θ) Ω Θ:

2 θGin(ω, θ) := θτ(θ) 2 h Lin(ω, τ(θ)) θτ(θ) + 2 θτ(θ) [ h Lin(ω, τ(θ))] , (58)

where 2 θτ(θ) is a linear operator measuring the distortion induced by the parameterization and acts on functions in H by mapping them to a matrix p p where p is the dimension of the parameter space Θ. If, in addition, τ is expressive enough so that τ(θ ω) = h ω, then the above expression simplifies to:

2 θGin(ω, θ ω) := θτ(θ ω)Cω θτ(θ ω) . (59)

Proposition F.1 follows by direct application of the chain rule, noting that the distortion term on the right of (58) vanishes when θ = θ ω since h Lin(ω, τ(θ ω)) = h Lin(ω, h ω) = 0 by optimality of h ω. A consequence is that, for an optimal parameter θ ω, the parametric Hessian is necessarily symmetric positive semi-definite. However, for an arbitrary parameter θ, the distortion does not vanish in general, making the Hessian possibly non-positive. This can result in numerical instability when using algorithms such as AID for which an adjoint vector is obtained by solving a quadratic problem defined by the Hessian matrix 2 θGin evaluated on approximate minimizers of the inner-level problem. Moreover, if the model admits multiple solutions θ ω, the Hessian is likely to be degenerate making the implicit function theorem inapplicable and the bilevel problem in Equation (PBO) ambiguously defined1. On the other hand, the functional implicit differentiation requires finding an adjoint function a ω by solving a positive definite quadratic problem in H which is always guaranteed to have a solution even when the inner-level prediction function is only approximately optimal. Proposition F.2. Assuming that τ is twice continuously differentiable and that for a fixed ω Ω we have τ(θ ω) = h ω, and Jω := θτ(θ ω) has a full rank, then, under the same assumptions as in Proposition 2.2, ωGtot(ω) is given by:

ωGtot(ω) = gω + BωPωa ω, (60)

where Pω : H H is a projection operator of rank dim(Θ). If, in addition, the equality τ(θ ω ) = h ω holds for all ω in a neighborhood of ω, then ωGtot(ω) := F(ω) = gω + Bωa ω.

Proposition F.2, which is proven below, shows that, even when the parametric family is expressive enough to recover the optimal prediction function h ω at a single value ω, the expression of the total gradient in Equation (60) using parametric implicit differentiation might generally differ from the one obtained using its functional counterpart. Indeed the projector Pω, which has a rank equal to dim(Θ), biases the adjoint function by projecting it into a finite dimensional space before applying the cross derivative operator. Only under a much stronger assumption on τ, requiring it to recover the optimal prediction function h ω in a neighborhood of the outer-level variable ω, both parametric and functional implicit differentiation recover the same expression for the total gradient. In this case, the projector operator aligns with the cross-derivative operator so that BωPω = Bω. Finally, note that the expressiveness assumptions on τ made in Propositions F.1 and F.2 are only used here to discuss the connection with the parametric implicit gradient and are not required by the method we introduce in Section 3.

Proof of Proposition F.2. Here we want to show the connection between the parametric gradient of the outer variable ωGtot(ω) usually used in approximate differentiation methods and the functional gradient of the outer variable F(ω) derived from the functional bilevel problem definition in Equation (FBO). Recall the definition of the parametric inner objective Gin(ω, θ) := Lin (ω, τ(θ)). According to Proposition F.1, we have the following relation

2 θGin(ω, θ ω) :=JωCωJ ω with Jω := θτ(θ ω).

By assumption, Jω has a full rank which matches the dimension of the parameter space Θ. Recall from the assumptions of Theorem 2.1 that the Hessian operator Cω is positive definite by the strong convexity of the inner-objective Lin in the second argument. We deduce that 2 θGin(ω, θ ω) must be invertible, since, by construction, the dimension of Θ is smaller than that of the Hilbert space H which has possibly infinite dimension. Recall from Theorem 2.1, Bω := ω,h Lin(ω, h ω) and the assumption that τ(θ ω) = h ω. We apply the parametric implicit function theorem to get the following expression of the Jacobian ωθ ω:

ωθ ω := BωJ ω JωCωJ ω 1 .

Hence, differentiating the total objective Gtot(ω) := Gout(ω, θ ω) = Lout (ω, τ(θ ω)) and applying the chain rule directly results in the following expression:

ωGtot(ω) = gω BωJ ω JωCωJ ω 1 Jωdω, (61)

with previously defined gω := ωLout(ω, h ω) and dω := h Lout (ω, h ω).

1although a generalized version of such a theorem was recently provided under restrictive assumptions [Arbel and Mairal, 2022b].

We now introduce the operator Pω := J ω JωCωJ ω 1 JωCω. The operator Pω is a projector as it satisfies P 2 ω = Pω. Hence, using the fact that the Hessian operator is invertible, and recalling that the adjoint function is given by a ω = C 1 ω dω, we directly get form Equation (61) that:

ωGtot(ω) := gω + BωPωa ω.

If we further assume that τ(θ ω ) = h ω holds for all ω in a neighborhood of ω, then differentiating with respect to ω results in the following identity:

ωθ ωJω = ωh ω.

Using the expression of ωh ω from Equation (3), we have the following identity:

ωθ ωJωCω = Bω.

In other words, Bω is of the form Bω := DJωCω for some finite dimensional matrix D of size dim(Ω) dim(Θ). Recalling the expression of the total gradient, we can deduce the equality between parametric and functional gradients:

ωGtot(ω) = gω BωJ ω JωCωJ ω 1 Jωdω

= gω DJωCωJ ω JωCωJ ω 1 Jωdω = gω DJωdω = gω DJωCωC 1 ω dω = gω + Bωa ω = F(ω).

The first equality follows from the general expression of the total gradient ωGtot(ω). In the second line we use the expression of Bω which then allows to simplify the expression in the third line. Then, recalling that the Hessian operator Cω is invertible, we get the fourth line. Finally, the result follows by using again the expression of Bω and recalling the definition of the adjoint function a ω.

F.2 Computational Cost and Scalability

The optimization of the prediction function ˆhω in the inner-level optimization loop is similar to AID, although the total gradient computation differs significantly. Unlike AID, Algorithm 1 does not require differentiating through the parameters of the prediction model when estimating the total gradient F(ω). This property results in an improved cost in time and memory in most practical cases as shown in Table 1 and Figure 4. More precisely, AID requires computing Hessian-vector products of size pin, which corresponds to the number of hidden layer weights of the neural network ˆhω. While Func ID only requires Hessian-vector products of size dv, i.e. the output dimension of ˆhω. In many practical cases, the network s parameter dimension pin is much larger than its output size dv, which results in considerable benefits in terms of memory when using Func ID rather than AID, as shown in Figure 4 (left). Furthermore, unlike AID, the overhead of evaluating Hessian-vector products in Func ID is not affected by the time cost for evaluating the prediction network. When ˆhω is a deep network, such an overhead increases significantly with the network size, making AID significantly slower (Figure 4 (right)).

Method Time cost Memory cost AID γ(TLin + Th) βpin + Mh Func ID γTLin + (2 + δ)Ta + Th βdv + Ma

Table 1: Cost in time and memory for performing a single total gradient estimation using either AID or Func ID and assuming the prediction model is learned. Time cost: Th and Ta represent the time cost of evaluating both prediction and adjoint models h and a, while Tin is the time cost for evaluating the inner objective once the outputs of h are computed. The factors γ and δ are multiplicative overheads for evaluating hessian-vector products and gradient. Memory cost: Mh and Ma represent the memory cost of storing the intermediate outputs of h and a, pin and dv are the memory costs of storing the Hessian-vector product for AID and Func ID respectively and β is a multiplicative constant that depends on a particular implementation.

102 104 106

Memory ratio Predicted ratio Memory reduction Memory increase

101 102 103 104 105 106 107

90% Average time ratio Predicted time ratio

Figure 4: Memory and time comparison of a single total gradient approximation using Func ID vs AID. (Left) Memory usage ratio of Func ID over AID vs inner model parameter dimension pin, for various values of the output dimension dv. (Right) Time ratio of Func ID over AID vs inner model parameter dimension pin averaged over several values of dv and 104 evaluations. The continuous lines are experimental results obtained using a JAX implementation [Bradbury et al., 2018] running on a GPU. The dashed lines correspond to theoretical estimates obtained using the algorithmic costs given in Table 1 with γ = 12, δ = 2 for time, and the constant factors in the memory cost fitted to the data.

G Additional Details about 2SLS Experiments

We closely follow the experimental setting of the state-of-the-art method DFIV [Xu et al., 2021a]. The goal of this experiment is to learn a model fω approximating the structural function fstruct that accurately describes the effect of the treatment t on the outcome o with the help of an instrument x, as illustrated in Figure 5.

Figure 5: The causal relationships between all variables in an Instrumental Variable (IV) causal graph, where t is the treatment variable (dsprites image), o is the outcome (label in R), x is the instrument and ϵ is the unobserved confounder

G.1 Dsprites data.

We follow the exact same data generation procedure as in Xu et al. [2021a, Appendix E.3]. From the dsprites dataset [Matthey et al., 2017], we generate the treatment t and outcome o as follows:

1. Uniformly sample latent parameters scale, rotation, pos X, pos Y from dsprites.

2. Generate treatment variable t as

t = Fig(scale, rotation, pos X, pos Y) + λ.

3. Generate outcome variable o as

o = At 2 2 5000 1000 + 32(pos Y 0.5) + ε.

Here, function Fig returns the corresponding image to the latent parameters, and λ, ε are noise variables generated from λ N(0.0, 0.1I) and ε N(0.0, 0.5). Each element of the matrix A R10 4096 is generated from Unif(0.0, 1.0) and fixed throughout the experiment. From the data generation process, we can see that t and o are confounded by pos Y. We use the instrumental variable x =(scale, rotation, pos X) R3, and figures with random noise as treatment variable t. The variable

pos Y is not revealed to the model, and there is no observable confounder. The structural function for this setting is

fstruct(t) = At 2 2 5000 1000 .

Test data points are generated from grid points of latent variables. The grid consist of 7 evenly spaced values for pos X, pos Y, 3 evenly spaced values for scale, and 4 evenly spaced values for orientation.

G.2 Experimental details

All results are reported over an average of 20 runs with different seeds on 24GB NVIDIA RTX A5000 GPUs.

Feature maps. As in the DFIV setting, we approximate the true structural function fstruct with fω = u ψχ(t) where ψχ is a feature map of the treatment t, u is a vector in Rd2, and fω is parameterized by ω = (u, χ). To solve the inner-problem of the bilevel formulation in Section 4.1, the inner prediction function hω is optimized over functions of the form h(x) = V ϕ(x) where we denote ϕ the feature map of the instrument x and V is a matrix in Rd1 d1. The feature maps ψχ and ϕ are neural networks (Table 2) that are optimized using empirical objectives from Section 3.1 and synthetic dsprites data, the linear weights V and u are fitted exactly at each iteration.

Choice of the adjoint function in Func ID. In the dsprites experiment, we call linear Func ID the functional implicit diff. method with a linear choice of the adjoint function. Linear Func ID uses an adjoint function of the form a ω(x) = Wϕ(x) with W Rd1 d1. In other words, to find a ω, the features ϕ are fixed and only the optimal linear weight W is computed in closed-form. In the Func ID method, the adjoint function lives in the same function space as hω. This is achieved by approximating a ω with a separate neural network with the same architecture as hω.

Layer instrument feature map ϕ 1 Input(x) 2 FC(3, 256), SN, Re LU 3 FC(256, 128), SN, Re LU, LN 4 FC(128, 128), SN, Re LU, LN 5 FC(128, 32), SN, LN, Re LU

Layer treatment feature map ψχ 1 Input(t) 2 FC(4096, 1024), SN, Re LU 3 FC(1024, 512), SN, Re LU, LN 4 FC(512, 128), SN, Re LU 5 FC(128, 32), SN, LN, Tanh

Table 2: Neural network architectures used in the dsprites experiment for all models. The Func ID model has an extra fully-connected layer FC(32, 1) in both networks. LN corresponds to Layer Norm and SN to Spectral Norm.

Hyper-parameter tuning. As in the setup of DFIV, for training all methods, we use 100 outer iterations (N in Algorithm 1), and 20 inner iterations (M in Algorithm 1) per outer iteration with full-batch. We select the hyper-parameters based on the best validation loss, which we obtain using a validation set with instances of all three variables (t, o, x) [Xu et al., 2021a, Appendix A]. Because of the number of linear solvers, the grid search performed for AID is very large, so we only run it with one seed. For other methods, we run the grid search on 4 different seeds and take the ones with the highest average validation loss. Additionally, for the hyper-parameters that are not tuned, we take the ones reported in Xu et al. [2021a].

Deep Feature Instrumental Variable Regression: All DFIV hyper-parameters are set based on the best ones reported in Xu et al. [2021a].

Approximate Implicit Differentiation: We perform a grid search over 5 linear solvers (two variants of gradient descent, two variants of conjugate gradient and an identity heuristic solver), linear solver learning rate 10 n with n {3, 4, 5}, linear solver number of iterations {2, 10, 20}, inner optimizer learning rate 10 n with n {2, 3, 4}, inner optimizer weight decay 10 n with n {1, 2, 3} and outer optimizer learning rate 10 n with n {2, 3, 4}.

Iterative Differentiation: We perform a grid search over number of unrolled inner iterations {2, 5} (this is chosen because of memory constraints since unrolling an iteration is memory-heavy), number of warm-start inner iterations {18, 15}, inner optimizer learning

rate 10 n with n {2, 3, 4}, inner optimizer weight decay 10 n with n {1, 2, 3} and outer optimizer learning rate 10 n with n {2, 3, 4}.

Gradient Penalty: The method is based on Eq. 5.1 in Shen and Chen [2023], for this singlelevel method we perform a grid search on the learning rate 10 n with n {2, 3, 4, 5, 6}, weight decay 10 n with n {1, 2, 3}, and the penalty weight 10 n with n {0, 1, 2, 3}. Since the method has only a single optimization loop, we increase the number of total iterations to 2000 compared to the other methods (100 outer-iterations and 20 inner iterations).

Value Function Penalty: The method is based on Eq. 3.2 in Shen and Chen [2023], for this method we perform the same grid search as for the Gradient Penalty method. However, since this method has an inner loop, we perform 100 outer iterations and perform a grid search on the number of inner iterations with 10 and 20.

Func ID: We perform a grid search over the number of iterations for learning the adjoint network {10, 20}, adjoint optimizer learning rate 10 n with n {2, 3, 4, 5, 6} and adjoint optimizer weight decay 10 n with n {1, 2, 3}. The rest of the parameters are the same as for DFIV since the inner and outer models are almost equivalent to the treatment and instrumental networks used in their experiments.

G.3 Additional results

We run an additional experiment with 10k training points using the same setting described above to illustrate the effect of the sample size on the methods. Figure 6 shows that a similar conclusion can be drawn when increasing the training sample size from 5k to 10k, thus illustrating the robustness of the obtained results.

Figure 6: Performance metrics for Instrumental Variable (IV) regression. (Left) final test loss. (Middle) outer loss vs training iterations, (Right) inner loss vs training iterations. All results are averaged over 20 runs with 10000 training samples and 588 test samples.

H Additional Details about Model-Based RL Experiments

H.1 Closed-form expression for the adjoint function

For the Func ID method, we exploit the structure of the adjoint objective to obtain a closed-form expression of the adjoint function a ω. In the model-based RL setting, the unregularized adjoint objective has a simple expression of the form:

ˆLadj(ω, a, ˆhω, B) = 1 2 |Bin|

(x,y) Bin a(x) 2 (62)

+ 1 |Bout|a(x) vf(ˆhω(x), y). (63)

The key observation here is that the same batches of data are used for both the inner and outer problems, i.e. Bin = Bout. Therefore, we only need to evaluate the function a on a finite set of points x where (x, y) Bin. Without restricting the solution set of a or adding regularization to ˆLadj, the optimal solution a ω simply matches vf(ˆhω(x), y) on the set of points x s.t. (x, y) Bin. Our implementation directly exploits this observation and uses the following expression for the total

gradient estimation:

(x,y) Bin ω,vf(ˆhω(x), rω(x), sω(x)) vf(ˆhω(x), y). (64)

H.2 Experimental details

As in the experiments of Nikishin et al. [2022], we use the Cart Pole environment with 2 actions, 4-dimensional continuous state space, and optimal returns of 500. For evaluation, we use a separate copy of the environment. The reported return is an average of 10 runs with different seeds.

Networks. We us the same neural network architectures that are used in the Cart Pole experiment of Nikishin et al. [2022, Appendix D]. All networks have two hidden layers and Re LU activations. Both hidden layers in all networks have dimension 32. In the misspecified setting with the limited model class capacity, we set the hidden layer dimension to 3 for the dynamics and reward networks.

Hyper-parameters. We perform 200000 environment steps (outer-level steps) and set the number of inner-level iterations to M = 1 for both OMD and Func ID. for MLE, we perform a single update to the state-value function for each update to the model. For training, we use a replay buffer with a batch size of 256, and set the discount factor γ to 0.99. When sampling actions, we use a temperature parameter α = 0.01 as in Nikishin et al. [2022]. The learning rate for outer parameters ω is set to 10 3. For the learning rate of the inner neural network and the moving average coefficient τ, we perform a grid search over 10 4, 10 3, 3 10 3 and {5 10 3, 10 2} as in Nikishin et al. [2022].

H.3 Additional results

Time comparison. Figure 7 shows the average reward on the evaluation environment as a function of training time in seconds. We observe that our model is the fastest to reach best performance both in the well-specified and misspecified settings.

Figure 7: Average Reward on an evaluation environment vs. time in seconds on the Cart Pole task. (Left) Well-specified predictive model with 32 hidden units to capture the variability in the states dynamics. (Right) misspecified predictive model with only 3 hidden states.

MDP model comparison. Figure 8 shows the average prediction error of different methods during training. The differences in average prediction error between the bilevel approaches (OMD, Func ID) and MLE reflect their distinct optimization objectives and trade-offs. OMD and Func ID focus on maximizing performance in the task environment, while MLE emphasizes accurate representation of all aspects of the environment, which can lead to smaller prediction errors but may not necessarily correlate with superior evaluation performance. We also observe that Func ID has a stable prediction error in both settings meanwhile OMD and MLE exhibit some instability.

Figure 8: Average MDP model prediction error in the training environment vs. inner optimization steps on the Cart Pole task. (Left) Well-specified predictive model with 32 hidden units to capture the variability in the states dynamics. (Right) misspecified predictive model with only 3 hidden states.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes] Justification: All the claims made in the paper are either rigorously proven, shown experimentally or describe well-established facts in the literature.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes] Justification: All strong assumptions used in the theoretical results are clearly stated, discussed, and put into perspective with other published work.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes] Justification: All theoretical claims are rigorously proven. We provide a structured list of all assumptions used before the statement of a result and refer to a specific assumption whenever it is employed in the proof. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We provide the code to reproduce our experiments, which includes a README file with instructions. We also include the information about tuning, datasets and other experimental detail in the appendix. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide an anonymized repository with extensively commented code and instructions on how to reproduce our main experiment. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Yes, we specify and discuss all experimental details in the appendix. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: In all of our experiments we run multiple times and report the error bars. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We report which GPUs we use for larger experiments and show how our method compares to similar methods in the section about computational cost. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We reviewed the Neur IPS Code of Ethics and attest that our work conforms with it. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: We present a new bilevel optimization method, which does not have any immediate societal impacts. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: Our new optimization method does not have an identifiable risk for misuse. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: Our code includes an open access licence in the README file. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: We do not present new assets in this work. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: We do not have crowdsourcing nor human subjects in our work. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: We do not have human subjects in our work. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.