# nearoptimal_solutions_of_constrained_learning_problems__8570213b.pdf Published as a conference paper at ICLR 2024 NEAR-OPTIMAL SOLUTIONS OF CONSTRAINED LEARNING PROBLEMS Juan Elenter University of Pennsylvania Luiz F. O. Chamon University of Stuttgart Alejandro Ribeiro University of Pennsylvania With the widespread adoption of machine learning systems, the need to curtail their behavior has become increasingly apparent. This is evidenced by recent advancements towards developing models that satisfy robustness, safety, and fairness requirements. These requirements can be imposed (with generalization guarantees) by formulating constrained learning problems that can then be tackled by dual ascent algorithms. Yet, though these algorithms converge in objective value, even in non-convex settings, they cannot guarantee that their outcome is feasible. Doing so requires randomizing over all iterates, which is impractical in virtually any modern applications. Still, final iterates have been observed to perform well in practice. In this work, we address this gap between theory and practice by characterizing the constraint violation of Lagrangian minimizers associated with optimal dual variables, despite lack of convexity. To do this, we leverage the fact that non-convex, finite-dimensional constrained learning problems can be seen as parametrizations of convex, functional problems. Our results show that rich parametrizations effectively mitigate the issue of feasibility in dual methods, shedding light on prior empirical successes of dual learning. We illustrate our findings in fair learning tasks. 1 INTRODUCTION Machine learning (ML) has become a core technology of information systems, reaching critical applications from medical diagnostics (Engelhard et al., 2023) to autonomous driving (Kiran et al., 2021). Consequently, it has become paramount to develop ML systems that not only excel at their main task, but also adhere to requirements such as fairness and robustness. Since virtually all ML models are trained using empirical risk minimization (ERM) (Vapnik, 1999), a natural way to impose requirements is to explicitly add constraints to these optimization problems (Cotter et al., 2018; Chamon & Ribeiro, 2020; Velloso & Van Hentenryck, 2020; Fioretto et al., 2021; Chamon et al., 2023). Recent works (Chamon & Ribeiro, 2020; Chamon et al., 2023) have shown that from a probably approximately correct (PAC) perspective, constrained learning is essentially as hard as classical learning and that, despite non-convexity, it can be tackled using dual algorithms that only involve a sequence of regularized, unconstrained ERM problems. This approach has been used in several domains, such as federated learning (Shen et al., 2022), fairness (Cotter et al., 2019; Tran et al., 2021), active learning (Elenter et al., 2022), adversarial robustness (Robey et al., 2021), and data augmentation (Hounie et al., 2022). These theoretical works, however, only address (i) the estimation error, arising from the empirical approximation of expectations in ERM and (ii) the approximation error, arising from using finite-dimensional models with limited functional representation capability. These are the leading challenges in unconstrained learning since the convergence properties of unconstrained optimization algorithms are well-understood in convex (e.g., (Bertsekas, 1997; Boyd & Vandenberghe, 2004)) as well as many non-convex settings (e.g., for overparametrized models (Soltanolkotabi et al., 2018; Brutzkus & Globerson, 2017; Ge et al., 2017)). This is not the case in constrained learning, where (iii) the optimization error can play a crucial role. Corresponding author: elenter@seas.upenn.edu Published as a conference paper at ICLR 2024 Indeed, dual methods are severely limited when it comes to recovering feasible solutions for constrained problems. In fact, not only might their primal iterates not converge to a feasible point [e.g, Fig. 1 or (Cotter et al., 2019, Section 6.3.1)], but they might not converge at all, displaying a cyclostationary behavior instead. This problem is hard even from an algorithmic complexity pointof-view (Daskalakis et al., 2021). For convex problems, this issue can be overcome by simply averaging the iterates (Nedi c & Ozdaglar, 2009). Non-convex problems, however, require randomization (Kearns et al., 2018; Agarwal et al., 2018; Goh et al., 2016; Chamon et al., 2023). This approach is not only impractical, given the need to store a growing sequence of primal iterates, but also raises ethical considerations, since randomization further hinders explainability. Yet, it has been observed that for typical modern ML tasks, taking the last or best iterate can perform well in practice (Cotter et al., 2018; Chamon & Ribeiro, 2020; Chamon et al., 2023; Robey et al., 2021; Elenter et al., 2022; Hounie et al., 2022; Shen et al., 2022; Gallego-Posada et al., 2022). This work addresses this gap between theory and practice by characterizing the sub-optimality and infeasibility of primal solutions associated with optimal dual variables. To do so, we observe that, though non-convex, constrained learning problems are generally parametrized versions of benign functional optimization problems. We then show that for sufficiently rich parametrizations, solutions obtained by dual algorithms closely approximate these functional solutions, not only in terms of optimal value as per (Cotter et al., 2019; Chamon & Ribeiro, 2020; Chamon et al., 2023), but also in terms of constraint satisfaction. This implies that dual ascent methods yield near-optimal and near-feasible solutions without randomization, despite non-convexity. 2 CONSTRAINED LEARNING 2.1 STATISTICAL CONSTRAINED RISK MINIMIZATION As in classical learning, constrained learning tasks can be formulated as a statistical optimization problem, namely, P p = min θ Θ ℓ0(fθ) := E(x,y)[ ℓ0(fθ(x), y)] s. to ℓi(fθ) := E(x,y)[ ℓi(fθ(x), y)] 0, i = 1, .., m (Pp) where fθ : X Y is a function associated with the parameter vector θ Θ Rp and the hypothesis class Fθ = {fθ : θ Θ} induced by this family of functions is assumed to be a subset of some compact functional space F L2. Throughout the paper, we use the subscript p (parametrized) to refer to quantities related to (Pp). The functionals ℓi : F R, i = 0, .., m, denote expected risks for loss functions ℓi. In this setting, ℓ0 can be interpreted as a top-line metric (e.g., accuracy), while the functionals ℓ1, .., ℓm encode statistical requirements that the solution must satisfy (see example below). Example 2.1: Learning under counterfactual fairness constraints. Consider the problem of learning an accurate classifier that is insensitive to changes in a set of protected attributes. Due to the correlation between these attributes and other features, simply hiding them from the model is not enough to guarantee this insensitivity. To do so, this requirement must be enforced explicitly. Indeed, consider the COMPAS study (Pro Publica, 2020), with the goal of predicting recidivism based on past offense data while controlling for gender and racial bias. Explicitly, let ℓ0 denote the cross-entropy loss ℓ0(ˆy, y) = log[ˆy]y. By collecting the protected features into the separate vector z, i.e., x = [ x, z], we can formulate the problem of learning a predictor insensitive to transformations ρi that encompass all possible single variable modifications of z. Explicitly, min θ Rp E h ℓ0 (fθ(x), y) i s. to E DKL(fθ( x, z) fθ( x, ρi(z)) c, i = 1, . . . , m, where c > 0 is the desired sensitivity level. Note that this formulation corresponds to the notion of (average) counterfactual fairness from (Kusner et al., 2018, Definition 5). In this setting, each constraint represents a requirement that the output of the classifier be near-invariant to changes in the protected features (here, gender and race). For instance, the prediction should be (almost) the Published as a conference paper at ICLR 2024 200 250 300 350 400 Iteration feasibility Constraint Value Evolution of constraint violation. Easy Hard Constraint Average 0 1 2 3 4 5 6 Constraint Infeasible Iterations (%) Percentage of infeasible iterations. Figure 1: Feastibility of primal iterates in a constrained learning problem with fairness requirements. Left: Example of a hard constraint which oscillates between feasibiliy and infeasibility, and an easy constraint which remains feasible for all iterations. Right: After training accuracy has settled (around half of training epochs), all but the last constraint are infeasible 30-45 % of the iterations. In fact, at least one constraint is violated on 85% of the iterations shown. We cannot therefore stop the algorithm and expect to obtain a feasible solution. same whether, all else being equal, the gender of the input is changed from Male to Female (and vice-versa) or the race is changed from Caucasian to African-American. Note that even if the losses ℓi(ˆy, y) are convex in (ˆy, y) (as is the case of the cross-entropy), the functions ℓi need not be convex in θ. This is the case, for instance, for typical modern ML models (e.g., if fθ is a neural network, NN). Hence, (Pp) is usually a non-convex optimization problem for which there is no straightforward way to project onto the feasibility set (e.g., onto the set of fair NNs). In light of these challenges, we turn to Lagrangian duality. 2.2 LEARNING IN THE DUAL DOMAIN Let the Lagrangian L : F Rm + R be defined as L(ϕ, λ) = ℓ0(ϕ) + λT ℓ(ϕ), (1) where ℓ= [ℓ1, .., ℓm] is a vector-valued functional collecting the constraints of (Pp). For reasons that will become apparent later, we define L over F Fθ. For a fixed dual variable λp, the Lagrangian L(fθ, λp) is a regularized version of (Pp), where ℓacts as the regularizing functional. This leads to the dual function gp(λp) = min θ Θ L(fθ, λp), (2) based on which we can in turn define the dual problem of (Pp) as D p = max λp 0 gp(λp). (Dp) This saddle-point problem can be viewed as a two-player game or as a regularized learning problem, where the regularization parameter is also an optimization variable. As such, (Dp) is a relaxation of (Pp), implying that D p P p . This is known as weak duality (Bertsekas, 1997). The dual function gp in (2) is concave, irrespective of whether (Pp) is convex (it is the pointwise minimum of a family of affine functions on λ (Boyd & Vandenberghe, 2004)). As such, though gp may not be differentiable, it can be equipped with supergradients that provide potential ascent directions. Explicitly, a vector s Rm is a supergradient of the concave function h : Rm R at a point x if h(z) h(x) s T (z x) for all z. The set of all supergradients of h at x is called the superdifferential and is denoted h(x). When the losses ℓi are continuous, the superdifferential of gp admits a simple description (Nedi c & Ozdaglar, 2009), namely, gp(λp) = conv ℓ(fθ(λp)) : fθ(λp) F θ (λp) , where conv(S) denotes the convex hull of the set S and F θ (λp) denotes the set of Lagrangian minimizers fθ(λp) associated to the multiplier λp, i.e., F θ (λp) = arg min θ Θ L(fθ, λp). (3) Published as a conference paper at ICLR 2024 Algorithm 1 Dual Constrained Learning 1: Inputs: number of iterations T N, step size η > 0. 2: Initialize: λ(1) = 0 3: for t = 1, . . . , T do 4: Obtain fθ(t) such that fθ(t) arg min θ Θ ℓ0(fθ) + λ(t)T ℓ(fθ) = arg min θ Θ L(fθ, λ(t)) 5: Update dual variables λi(t + 1) = max h 0, λi(t) + η ℓi(fθ(t)) i In particular, this leads to an algorithm for solving (Dp) known as projected supergradient ascent (Polyak, 1987; Shor, 2013) that we summarize in Algorithm 1. When executing Algorithm 1, dual iterates λp(t) move in ascent directions of the concave function gp (Shor, 2013, Section 2.4). Yet, the sequence of primal iterates {fθ(t)}T t=1 obtained as a byproduct need not approach the set of solutions of (Pp). The experiment in Figure 1 showcases this behaviour and illustrates that, in general, one can not simply stop the dual ascent algorithm at any iteration t and expect the primal iterate fθ(t) to be feasible. Additionally, the Lagrangian minimizers are not unique. In particular, for an optimal dual variable λ p Λ p, the set F θ (λ p) is typically not a singleton and could contain infeasible elements (i.e, ℓi(fθ(λ p)) > 0 for some i 1). Even more so, as λp(t) approaches Λ p, the constraint satisfaction of primal iterates can exhibit pathological cyclostationary behaviour, where one or more constraints oscillate between feasibility and infeasibility, see e.g., (Cotter et al., 2019, Section 6.3.1). For these reasons, convergence guarantees for non-convex optimization problems typically require randomization over (a subset of) the sequence {fθ(t)}T t=1, which is far from practical [see e.g, (Agarwal et al., 2018, Theorem 2), (Kearns et al., 2018, Theorem 4.1), (Cotter et al., 2019, Theorem 2), (Chamon et al., 2023, Theorem 3)]. In the sequel, we show conditions under which this is not necessary. 3 NEAR-OPTIMAL SOLUTIONS OF CONSTRAINED LEARNING PROBLEMS Primal iterates obtained as a by-product of the dual ascent method in Algorithm 1 may fail to be solutions of (Pp). However, it has been observed that taking the last or best iterate can perform well in practice. This can be understood by viewing (Pp) as the parametrized version of a benign functional program, ammenable to a Lagrangian formulation. This unparametrized problem does not suffer from the same limitations as (Pp) in terms of primal recovery and we can thus use its solution as a reference point to measure the sub-optimality of the primal iterates obtained with Algorithm 1. The unparametrized constrained learning problem is defined as P u = min ϕ F ℓ0(ϕ) s.to ℓi(ϕ) 0, i = 1, .., m (Pu) where F is a convex, compact subset of an L2 space. For instance, F can be a subset of the space of continuous functions or a reproducing kernel Hilbert space (RKHS) and Fθ can be induced by a neural network architecture with smooth activations or a finite linear combinations of kernels. In both cases, we know that Fθ can uniformly approximate F arbitrarily well as the dimension of θ grows (Hornik, 1991; Berlinet & Thomas-Agnan, 2011). The smallest choice of F is in fact conv(Fθ) (closed convex hull of Fθ). Analogous to the definitions from Section 2.1, gu(λu) := min ϕ F L(ϕ, λu) Published as a conference paper at ICLR 2024 denotes the unparametrized dual function, F (λu) = arg minϕ F L(ϕ, λu) is the set of Lagrangian minimizers ϕ(λu) associated to λu and D u = max λu 0 gu(λu) (Du) is the unparametrized dual problem. The subscript u is used to denote quantities related to the unparametrized problem (Pu). We now present two assumptions that allow us to characterize the relation between the dual and primal solutions of problem Du. Assumption 3.1. The functionals ℓi , i = 0, . . . , m, are convex and M Lipschitz continuous in F. Additionally, ℓ0 is µ0 strongly convex. Note that we require convexity of the losses with respect to their functional arguments and not model parameters θ, which holds for most typical losses, e.g, mean squared error and cross-entropy loss. Assumption 3.2. There exists ϕ F such that ℓ(ϕ) min[0, ℓ(ϕ(λ p)), ℓ(fθ(λ p))] for all ϕ(λ p) F(λ p), fθ(λ p) Fθ(λ p) and λ p Λ p. Assumption 3.2 is a stronger version of Slater s constraint qualification, which requires only ℓ(ϕ) 0. Here, we require the existence of a (suboptimal) candidate ϕ that is strictly feasible even for perturbed versions of (Pu). Under these assumptions, the Lagrangian minimizer is unique. This makes the superdifferential of the dual function a singleton at every λu: gu(λu) = {ℓ(ϕ(λu))}, which means that the dual function gu(λu) is differentiable (Shor, 2013). Let ϕ be a solution of problem (Pu). Assumptions 3.1 and 3.2 imply that strong duality (i.e, P u = D u) holds in this problem, and that at λ u, there is a unique Lagrangian minimizer ϕ (λ u) = ϕ which is, by definition, feasible (Bertsekas, 1997). The only difference between problems (Pp) and (Pu) is the set over which the optimization is carried out. Thus, if the parametrization Θ is rich enough (e.g, deep neural networks), the set Fθ is essentially the same as F, and we should expect the properties of the solutions to problems (Dp) and (Du) to be similar. This insight leads us to the ν near universality of the parametrization assumption. Assumption 3.3. For all ϕ F, there exists θ Θ such that ϕ fθ L2 ν. The constant ν in Assumption 3.3 is a measure of how well Fθ covers F. Consider, for instance, that F is the set of continuous functions and Fθ the set of functions implementable with a twolayer neural network with sigmoid activations and K hidden neurons. If the parametrization has 10 neurons in the hidden layer, it is considerably worse at representing elements in F than one with 1000 neurons. While determining the exact value of ν is in general not straightforward, any ν > 0 can be achieved for a large enough number of neurons (Hornik, 1991). The same holds for the number of kernels and an RKHS (Berlinet & Thomas-Agnan, 2011). Given these facts, it is legitimate to ask: how close are the elements of Fθ(λ p) to ϕ in terms of their optimality and constraint satisfaction? Bounding these errors would theoretically justify the use of last primal iterates, doing away with the need for randomization. 3.1 NEAR-OPTIMALITY AND NEAR-FEASIBILITY OF DUAL LEARNING A key challenge of using duality to undertake (Pp) is that the value D p of the dual problem (Dp) need not be a good approximation of the value P p of (Pp) (i.e, lack of strong duality). This was tackled in (Chamon et al., 2023, Prop. 3.3). Explicitly, under Assumptions 3.1-3.3, the duality gap of problem (Pp) is bounded as in P p D p Mν(1 + λ 1) := Γ1, (4) where λ maximizes gp(λ) = gp(λ) + Mν λ 1. This result, however, only shows that the dual problem can be used to approximate the value of the constrained problem (Pp). It says nothing about whether it can provide a (near-)feasible solution, which is the main issue addressed in this paper. We next characterize the sub-optimality and constraint violation of the Lagrangian minimizers fθ(λ p) Fθ(λ p) by comparing these primal variables with the solution of the unparametrized problem ϕ . The curvature of the unparametrtized dual function gu(λu) around its optimum is central to this analysis. We will first provide a result with the following assumption on this curvature and then Published as a conference paper at ICLR 2024 describe its connection to the properties of (Pp). Let Hλ := {γλ u + (1 γ)λ p : γ [0, 1]} denote the segment connecting λ u and λ p. Assumption 3.4. The dual function gu is µg strongly concave and βg smooth along Hλ . The following proposition characterizes the constraint violation for all fθ(λ p) F θ (λ p) with respect to ϕ ; the optimal, feasible solution of the unparametrized problem. Proposition 3.1. Under Assumptions 3.1-3.4, any fθ(λ p) F θ (λ p), approximates the constraint value of the solution ϕ of (Pu) as in: ℓ(fθ(λ p)) ℓ(ϕ ) 2 2 2βg Mν(1 + λ p 1) Since (Pu) is feasible, ℓ(ϕ ) is non-positive. Hence, the approximation bound in Proposition 3.1 is stronger than an infeasibility bound on fθ(λ p). Indeed, it says not that ℓ(fθ(λ p)) 0, but that it approximates the constraint values of the optimal solution ϕ . The ratio βg/µg (i.e, the condition number of gu), which determines optimal step sizes in dual ascent methods (Polyak, 1987), also plays a key role here, representing the tension between two fundamental forces driving this bound. On the one hand, the sensitivity of the dual problems, controlled by µg, which determines how different λ u and λ p are. On the other hand, the sensitivity of the primal problems, linked to the smoothness constant βg, which determines the effect of this difference on feasibility. Nevertheless, Proposition 3.1 remains abstract. To connect it to the properties of (Pp), we rely on the following assumptions to obtain bounds on µg and βg. Assumption 3.5. The functionals ℓi , i = 0, . . . , m are β-smooth on F. Assumption 3.6. The Jacobian Dϕℓ(ϕ ) is full-row rank at the optimum, i.e, there exists σ > 0 such that inf λ 2=1 λT Dϕℓ(ϕ ) L2 σ, where Dϕℓ(ϕ ) denotes the Fr echet derivative of the functional ℓat ϕ (see definition in Appendix A.1). Assumption 3.6 is unlike the previous regularity assumptions over which a practitioner has full control and is not straightforward to satisfy at first sight. It is, however, a typical assumption used to derive duality results in convex optimization known as linear independence constraint qualification or LICQ (Bertsekas, 1997). As such, it could be replaced by a different constraint qualification, such as a stricter version of Assumption 3.2. This is, however, left for future work. Under these assumptions, we can describe the curvature of gu in terms of the problem parameters as follows. Lemma 3.1. Under assumptions 3.1, 3.2, 3.5 and 3.6, gu(λu) is µg strongly concave and βg smooth on Hλ for µg = µ0 σ2 β2(1+ )2 and βg = m M 2 µ0 , where = max( λ u 1, λ p 1). Having characterized the curvature of the unparametrized dual function gu, we can now state the main result of this section, which puts together Proposition 3.1, Lemma 3.1, and the near-optimality result from (Chamon et al., 2023) in (4) to bound the near-optimality and near-feasibility of Lagrangian minimizers associated to optimal dual variables. Theorem 3.1. Under assumptions 3.1, 3.2, 3.3, 3.5 and 3.6, the sub-optimality and infeasibility of any fθ(λ p) F(λ p) is bounded by: ℓ(fθ(λ p)) ℓ(ϕ ) Γ2 := M [1 + κ1κ0(1 + )] µ0 (1 + λ p 1) (5) |P p ℓ0(fθ(λ p))| (1 + λ p 1)Mν + Γ1 + λ p 1Γ2 (6) with κ1 = M σ , κ0 = β µ0 , = max{ λ u 1, λ p 1} and Γ1 as in (4). Theorem 3.1 shows that the dual problem (Dp) not only approximates the value P p of (Pp), but also provides approximate solutions for it. The quality of these approximations depends on three factors. First, the sensitivity of the learning problem, as captured by the Lipschitz constant M and the constants κ1 and κ0, that correspond to the condition numbers of the constraint Jacobian and the Published as a conference paper at ICLR 2024 objective function respectively. Overall, these quantities measure how well-conditioned the problem is. Second, the requirements difficulty. Indeed, the optimal dual variables can be seen as measures of the sensitivity of the objective value with respect to constraint perturbations (see, e.g., (Boyd & Vandenberghe, 2004)). Hence, the more stringent the constraints, the larger λ u 1 and/or λ p 1. Finally, the approximation error depends on the factor ν that denotes the richness of the parametrization, i.e., how good it is at approximating functions in F (Assumption 3.3). In fact, Theorem 3.1 shows that as the model capacity increases (ν decreases), the approximation bounds (5) (6) improve. This behavior is not trivial. Indeed, while we expect that richer parametrizations lead to lower approximation errors, Theorem 3.1 states that they also make solving the optimization problem (Pp) easier, since dual solutions then provide better approximations of primal solutions. Observe that the effect of these factors on feasibility in (5) are similar to those on optimality in (6) and, e.g., (Chamon et al., 2023). Next, we leverage these results to provide convergence guarantees for Algorithm 1. But first, we outline the main ideas behind the proof of Proposition 3.1. 3.2 PROOF SKETCH In this section, we provide a brief outline of the proof of Theorem 3.1. We begin by decomposing the distance between constraint violations as ℓ(fθ(λ p)) ℓ(ϕ ) 2 = ℓ(fθ(λ p)) ℓ(ϕ(λ p)) + ℓ(ϕ(λ p)) ℓ(ϕ ) 2 ℓ(fθ(λ p)) ℓ(ϕ(λ p)) 2 + ℓ(ϕ(λ p)) ℓ(ϕ(λ u)) 2 (7) The first term captures the effect of parametrizing the hypothesis class for a fixed dual variable. In contrast, the second term characterizes the effect of changing the dual variables on the unparametrized Lagrangian minimizer. This is made clear in (7) by using the fact that ϕ = ϕ(λ u) (see discussion in Section 3). In the sequel, we analyze each of these terms separately. For conciseness, all technical definitions from this section are deferred to Appendix A.1. 3.2.1 DUAL VARIABLE PERTURBATION We begin by analyzing the second term in (7). Recall from the beginning of Section 3 that under Assumption 3.1 3.2, it holds that λgu(λ) = ℓ(ϕ(λ)). Hence, ℓ(ϕ(λ p)) ℓ(ϕ(λ u)) 2 = λgu(λ p) λgu(λ u) 2. Using the βg-smoothness of gu, this gradient difference can be bounded using λ p λ u 2. The latter can in turn be bounded by combining the ν-universality of the parametrization (Assumption 3.3) and convex optimization perturbation results to obtain: Proposition 3.2. Under assumptions 3.1-3.4, the distance between the constraint violations of ϕ(λ p) and ϕ(λ u) is bounded by: ℓ(ϕ(λ p)) ℓ(ϕ(λ u)) 2 2 2β2 g µg Mν(1 + λ p 1) (8) 3.2.2 HYPOTHESIS CLASS PERTURBATION Bounding the first term in (7) is less straightforward. To do so, we rely on the perturbation function of the unparametrized problem (Pu), defined as P u(ϵ) = min ϕ F ℓ0(ϕ) s.to ℓ(ϕ) + ϵ 0, (Pϵ) for some perturbation ϵ Rm. Intuitively, P u(ϵ) quantifies the impact on the objective value of modifying the constraint specifications by ϵ. Note that the unparametrized problem (Pu) is recovered for ϵ = 0. Motivated by the fact that we can get a strong handle on the sensitivity of the perturbation function (Pϵ), we seek to bound ℓ(fθ(λ p)) ℓ(ϕ(λ p)) 2 by instead analyzing |P u(ϵp) P u(ϵu)| for ϵp = ℓ(fθ(λ p)) and ϵu = ℓ(ϕ(λ p)). Indeed, it holds for every λ Rm + that P (λ) = gu(λ), where denotes the Fenchel conjugate (see Appendix A.4). We can therefore relate the curvature of gu to that P u(ϵ) (Kakade et al., 2009) to obtain: Proposition 3.3. Under assumptions 3.1-3.4, the distance between constraint violations associated to the parametrization of the hypothesis class is given by: ℓ(ϕ(λ p)) ℓ(fθ(λ p)) 2 2 2βg Mν(1 + λ p 1) Published as a conference paper at ICLR 2024 Using Propositions 3.2 3.3 in (7) yields Proposition 3.1. Theorem 3.1 is then obtained by further leveraging Lemma 3.1 and the bound on the duality gap P p D p in (4) (see Appendix A.13.2). 4 BEST ITERATE CONVERGENCE In this section, we leverage the connection between the parameterized [cf. (Pp) and (Dp)] and unparameterized [cf. (Pu) and (Du)] problems to analyze the convergence of Algorithm 1. Seeking a more general result, we relax Steps 4 and 5 to allow for approximate Lagrangian minimization and the use of stochastic supergradients of the dual function respectively. Explicitly, we assume that for all t, the oracle in Step 4 returns a function f o θ (t) such that L(f o θ (t), λp) min θ Θ L(fθ, λ(t)) + ρ, (9) for an approximation error ρ 0. In contrast to Step 4, equation 9 accounts for potential numerical and approximation errors in the computation of the Lagrangian minimizer. The existence of such a ρ-approximate oracle is a typical assumption in the analysis of dual algorithms (Cotter et al., 2019; Chamon et al., 2023; Kearns et al., 2018) and is often justified by substantial theoretical and empirical evidence that many ML optimization problems can be efficiently solved despite nonconvexity. That is the case, e.g., for deep neural networks (Zhang et al., 2021; Brutzkus & Globerson, 2017; Soltanolkotabi et al., 2018; Ge et al., 2017). Additionally, we consider that the dual variable update in Step 5 is replaced by λi(t + 1) = max h 0, λi(t) + η ˆℓi(f o θ (t)) i , (10) where ˆℓi(f o θ (t)) are conditionally unbiased estimates of the statistical risks ℓi(f o θ (t)), i.e., E[ˆℓi(f o θ (t))|λ(t)] = ℓi(f o θ (t)). This stochastic update accounts for, e.g., the use of independent sample batches to estimate the constraint slacks ℓi(f o θ ). The following Lemma establishes the convergence of the best iterate of (9) (10), i.e., of the dual variables λi(t) that yield the largest dual function for all t. Lemma 4.1. Let gbest p (t|λ(t0)) = maxs [t0,t] gp(λ(s)) be the maximum value of the parametrized dual function up to time t. Then, for all t0 > 1, it holds that lim t gbest p (t|λ(t0)) D p ηS2 2 + ρ a.s., where S2 > Pm i=1 E |ˆℓi(f o θ (t))|2|λ(t) . The existence of a finite S2 is implied by the assumption that Fθ F L2. Lemma 4.1 implies that for any δ > 0, there exists a finite t such that λ(t ) achieves the value D p ηS2 2 + ρ + δ . We denote this iterate λbest. Note that the step size η can be reduced so as to make gbest p arbitrarily close to D p ρ (asymptotically). In view of the bound on the duality gap P p D p in (4), Lemma 4.1 implies that λbest is near-optimal. Combine with the near-feasibility results from Section 3, we can also bound the constraint violations of the Lagragian minimizer associated with λbest. Proposition 4.1. Let λbest be any dual iterate that achieves gp(λbest) D p (ηS2/2 + ρ). Suppose there exists ϕ F such that ℓ(ϕ) min{0, ℓ(ϕ(λbest)), ℓ(fθ(λbest))} and that Assumptions 3.1, 3.3, 3.5, and 3.6 hold. Then, ℓ(ϕ ) ℓ(fθ(λbest))) 2 2 2βg Mν(1 + λbest 1) + ηS2 where µg = µ0 σ2 β2(1+max{ λ u 1, λbest 1})2 . Reasonably, the bound in Proposition 4.1 is governed by the same terms as Theorem 3.1. Here, however, the bound is the loosened by the sub-optimality of λbest with respect to λ p. Published as a conference paper at ICLR 2024 Unconstrained Last Randomized Model Test Accuracy Afr. American - Caucasian Afr. American - Hispanic Afr. American - Other Caucasian - Hispanic Caucasian - Other Hispanic - Other Counterfactual Fairness 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Number of features Constraint violation Maximum constraint violation for increasing model capacity. Figure 2: Left: the Unconstrained model performs better in terms of average test accuracy than both the Last and Randomized model. Middle: Both constrained models do better in terms of Counterfactual Fairness. The key point is that the Last iterate is never far from the Randomized one in terms of constraint violation. Right: As the richness of the parametrization increases the maximum constraint violation (i.e: size of the oscillations) decreases. 5 EXPERIMENTAL VALIDATION To illustrate the theoretical results from Sections 3 and 4, we return to the counterfactually fair learning problem from Example 2.1. We work with the COMPAS dataset, where the task is to predict recidivism while remaining insensitive to the protected variables gender and race, which can take the values [ Male , Female ] and [ African American , Hispanic , Caucasian , Other ] respectively. We take the parametrized model fθ to be a 2-layer NN with sigmoid activations, so that the resulting constrained learning problem is non-convex. Further experimental details are provided in Appendix A.16. We compare the accuracy and constraint satisfaction of three models: an unconstrained predictor, trained without any additional constraints; a last iterate predictor, corresponding to the final iterate fθ(T) of an empirical version of Algorithm 1; and a randomized predictor that samples a model uniformly at random from the sequence of primal iterates {fθ(t))}T t=t0 for each prediction. As shown in Fig. 2 (Left), the unconstrained model is slightly better than the two constrained ones in terms of predictive accuracy. This advantage comes at the cost of less counterfactually fair predictions, i.e., a model more sensitive to the protected features (Fig. 2, Middle). The key point of this experiment, however, is that the last iterate and randomized predictors provide similar accuracy and constraint satisfaction, as predicted by Theorem 3.1. Additionally, Fig. 2 (Right) showcases the impact of the parametrization richness on the constraint violation of last primal iterates. We control this richness by means of projecting the data onto a lower dimensional space using a fixed, random linear map. Note that, as Theorem 3.1 indicates, the constraint violation decreases by up to an order of magnitude as we increase the capacity of the model. As we have observed before, this behavior is not straightforward: though richer parametrizations are expected to lead to lower approximation errors, it is not immediate that it should make the optimization problem (Pp) easier to solve. 6 CONCLUSION We analyzed primal iterates obtained from a dual ascent method when solving the Lagrangian dual of a primal non-convex constrained learning problem. The primal problem in question is the parametrized version of a convex functional program, which is amenable to a Lagrangian formulation. Specifically, we characterized how far these predictors are from the solution of the unparametrized problem in terms of their optimality and constraint violation. This result led to a characterization of the infeasibility of best primal iterates and elucidated the role of the capacity of the model and the curvature of the objective. These guarantees bridge a gap between theory and practice in constrained learning, shedding light on when and why randomization is unnecessary. The findings presented in this work can be extended in several ways. For instance, a study of the estimation error incurred by minimizing the empirical Lagrangian in Algorithm 1 could be added. It might also be possible to characterize the curvature of the dual function by alternative means, which could potentially lift assumptions on the unparametrized problem. Published as a conference paper at ICLR 2024 6.1 ACKNOWLEDGMENTS The work of A. Ribeiro and J. Elenter is supported by NSF-Simons Mo DL, Award 2031985, NSF AI Institutes program. The work of L.F.O. Chamon is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany s Excellence Strategy (EXC 2075390740016). Alekh Agarwal, Alina Beygelzimer, Miroslav Dud ık, John Langford, and Hanna Wallach. A reductions approach to fair classification. In International conference on machine learning, pp. 60 69. PMLR, 2018. Tamer Bas ar and Pierre Bernhard. H-infinity optimal control and related minimax design problems: a dynamic game approach. Springer Science & Business Media, 2008. Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011. Dimitri P Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3): 334 334, 1997. J. Fr ed eric Bonnans and Alexander Shapiro. Optimization problems with perturbations: A guided tour. SIAM Review, 40(2):228 264, 1998. ISSN 00361445. URL http://www.jstor.org/ stable/2653333. Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. In International conference on machine learning, pp. 605 614. PMLR, 2017. Luiz Chamon and Alejandro Ribeiro. Probably approximately correct constrained learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 16722 16735. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/ file/c291b01517f3e6797c774c306591cc32-Paper.pdf. Luiz F. O. Chamon, Santiago Paternain, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained learning with non-convex losses. IEEE Transactions on Information Theory, 69(3):1739 1760, 2023. doi: 10.1109/TIT.2022.3187948. Andrew Cotter, Maya R. Gupta, Heinrich Jiang, Nathan Srebro, Karthik Sridharan, Serena Lutong Wang, Blake E. Woodworth, and Seungil You. Training well-generalizing classifiers for fairness metrics and other data-dependent constraints. In International Conference on Machine Learning, 2018. URL https://api.semanticscholar.org/Corpus ID:49556538. Andrew Cotter, Heinrich Jiang, Taman Narayan, Seungil You, and Karthik Sridharan. Optimization with non-differentiable constraints with applications to fairness, recall, churn, and other goals. J. Mach. Learn. Res., 20(172):1 59, 2019. Constantinos Daskalakis, Stratis Skoulakis, and Manolis Zampetakis. The complexity of constrained min-max optimization. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp. 1466 1478, 2021. Juan Elenter, Navid Naderi Alizadeh, and Alejandro Ribeiro. A lagrangian duality approach to active learning. Advances in Neural Information Processing Systems, 35:37575 37589, 2022. Matthew M. Engelhard, Ricardo Henao, Samuel I. Berchuck, Junya Chen, Brian Eichner, Darby Herkert, Scott H. Kollins, Andrew Olson, Eliana M. Perrin, Ursula Rogers, Connor Sullivan, Yi Qin Zhu, Guillermo Sapiro, and Geraldine Dawson. Predictive Value of Early Autism Detection Models Based on Electronic Health Record Data Collected Before Age 1 Year. JAMA Network Open, 6(2):e2254303 e2254303, 02 2023. ISSN 2574-3805. doi: 10.1001/jamanetworkopen. 2022.54303. URL https://doi.org/10.1001/jamanetworkopen.2022.54303. Published as a conference paper at ICLR 2024 Ferdinando Fioretto, Pascal Van Hentenryck, Terrence W. K. Mak, Cuong Tran, Federico Baldo, and Michele Lombardi. Lagrangian duality for constrained deep learning. In Yuxiao Dong, Georgiana Ifrim, Dunja Mladeni c, Craig Saunders, and Sofie Van Hoecke (eds.), Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track, pp. 118 135, Cham, 2021. Springer International Publishing. ISBN 978-3-030-67670-4. Jose Gallego-Posada, Juan Ramirez, Akram Erraqabi, Yoshua Bengio, and Simon Lacoste-Julien. Controlled sparsity via constrained optimization or: How i learned to stop tuning penalties and love constraints. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 1253 1266. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 089b592cccfafdca8e0178e85b609f19-Paper-Conference.pdf. Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. ar Xiv preprint ar Xiv:1711.00501, 2017. Rafal Goebel and R Tyrrell Rockafellar. Local strong convexity and local lipschitz continuity of the gradient of convex functions. Journal of Convex Analysis, 15(2):263, 2008. Gabriel Goh, Andrew Cotter, Maya Gupta, and Michael P Friedlander. Satisfying real-world goals with dataset constraints. Advances in neural information processing systems, 29, 2016. Vincent Guigues. Inexact stochastic mirror descent for two-stage nonlinear stochastic programs, 2020. Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4 (2):251 257, 1991. Ignacio Hounie, Luiz F. O. Chamon, and Alejandro Ribeiro. Automatic data augmentation via invariance-constrained learning, 2022. Sham Kakade, Shai Shalev-Shwartz, Ambuj Tewari, et al. On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. Unpublished Manuscript, http://ttic. uchicago. edu/shai/papers/Kakade Shalev Tewari09. pdf, 2(1):35, 2009. Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International conference on machine learning, pp. 2564 2572. PMLR, 2018. B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A Al Sallab, Senthil Yogamani, and Patrick P erez. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(6):4909 4926, 2021. Andrew J Kurdila and Michael Zabarankin. Convex functional analysis. Springer Science & Business Media, 2006. Matt J. Kusner, Joshua R. Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness, 2018. Angelia Nedi c and Asuman Ozdaglar. Approximate primal solutions and rate analysis for dual subgradient methods. SIAM Journal on Optimization, 19(4):1757 1780, 2009. doi: 10.1137/ 070708111. URL https://doi.org/10.1137/070708111. Boris T Polyak. Introduction to optimization. 1987. Pro Publica. Compas recidivism risk score data and analysis., 2020. URL https://www.propublica.org/datastore/dataset/ compas-recidivism-risk-score-data-and-analysis. Alejandro Ribeiro. Ergodic stochastic optimization algorithms for wireless communication and networking. IEEE Transactions on Signal Processing, 58(12):6369 6386, 2010. doi: 10.1109/ TSP.2010.2057247. Published as a conference paper at ICLR 2024 Alexander Robey, Luiz Chamon, George J. Pappas, Hamed Hassani, and Alejandro Ribeiro. Adversarial robustness with semi-infinite constrained learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 6198 6215. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/ 312ecfdfa8b239e076b114498ce21905-Paper.pdf. R Tyrrell Rockafellar. Conjugate duality and optimization. SIAM, 1974. R Tyrrell Rockafellar. Convex analysis, volume 11. Princeton university press, 1997. Zebang Shen, Juan Cervino, Hamed Hassani, and Alejandro Ribeiro. An agnostic approach to federated learning with class imbalance. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Xo0lb Dt975. N.Z. Shor. Nondifferentiable Optimization and Polynomial Problems. Nonconvex Optimization and Its Applications. Springer US, 2013. ISBN 9781475760156. URL https://books. google.com/books?id=_L_VBw AAQBAJ. Victor Solo and Xuan Kong. Adaptive signal processing algorithms: Stability and performance. 1994. URL https://api.semanticscholar.org/Corpus ID:61115048. Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742 769, 2018. Cuong Tran, Ferdinando Fioretto, and Pascal Van Hentenryck. Differentially private and fair deep learning: A lagrangian dual approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 9932 9939, 2021. V.N. Vapnik. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10 (5):988 999, 1999. doi: 10.1109/72.788640. Alexandre Velloso and Pascal Van Hentenryck. Combining deep learning and optimization for security-constrained optimal power flow. ar Xiv preprint ar Xiv:2007.07002, 2020. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Commun. ACM, 64(3):107 115, feb 2021. ISSN 0001-0782. doi: 10.1145/3446776. URL https://doi.org/10.1145/3446776. Published as a conference paper at ICLR 2024 A.1 ADDITIONAL DEFINITIONS Definition A.1. We say that a functional ℓi : F R is Fr echet differentiable at ϕ0 F if there exists an operator Dϕℓi(ϕ0) B(F, R) such that: lim h 0 |ℓi(ϕ0 + h) ℓi(ϕ0) Dϕℓi(ϕ0), h | where B(F, R) denotes the space of bounded linear operators from F to R. The space B(F, R), algebraic dual of F, is equipped with the corresponding dual norm: B L2 = sup | B, ϕ | ϕ L2 : ϕ F , ϕ L2 = 0 which coincides with the L2 norm through Riesz s Representation Theorem: there exists a unique g F such that B(ϕ) = ϕ, g for all ϕ and B L2 = g L2. Definition A.2. A function h : X R is said to be closed if for each α R, the sublevel set {h(x) α : x X} is a closed set. Definition A.3. A convex function h : X R is proper if h(x) > for all x X and there exists x0 X such that h(x0) < + . Definition A.4. Let X be an Euclidean vector space. Given a convex function h : X R { }, its Fenchel conjugate h : X R { } is defined as: h (y) = sup x X x, y h(x) A.2 PROOF OF LEMMA A.1: DISTANCE BETWEEN DUAL FUNCTIONS Lemma A.1. The point-wise distance between the parametrized and unparametrized dual functions is bounded by: 0 gp(λ) gu(λ) Mν(1 + λ 1) λ 0 (11) As defined in section 2.1, ϕ(λ) denotes the Lagrangian minimizer associated to the multiplier λ in the unparametrized problem. By the near-universality assumption, θ Θ such that ϕ(λ) f θ L2 ν. Note that, L(f θ, λ) L(ϕ(λ), λ) = ℓ0(f θ) ℓ(ϕ(λ)) + λT ℓ(f θ) ℓ(ϕ(λ)) ℓ0(f θ) ℓ(ϕ(λ)) 2 + i=1 [λ]i ℓ(f θ) ℓ(ϕ(λ)) 2 where we used the triangle inequality twice. Then, using the M Lipschitz continuity of the functionals ℓi and the fact that ϕ(λ) f θ 2 ν, we obtain: L(f θ, λ) L(ϕ(λ), λ) M f θ ϕ(λ) L2 + M i=1 [λ]i f θ ϕ(λ) L2 i=1 [λ]i = Mν(1 + λ 1) Since fθ(λ) F θ (λ) is a Lagrangian minimizer, we know that L(fθ(λ), λ) L(f θ, λ). Thus, 0 L(fθ(λ), λ) L(ϕ(λ), λ) L(f θ, λ) L(ϕ(λ), λ) where the non-negativity comes from the fact that FΘ F. This implies: 0 gp(λ) gu(λ) Mν( λ 1 + 1) λ 0 which conludes the proof. Published as a conference paper at ICLR 2024 A.3 PROOF OF LEMMA A.2: DIFFERENTIABILITY OF gu(λu) Lemma A.2. Under assumption 3.1, the unparametrized dual function gu(λ) is everywhere differentiable with gradient λgu(λ) = ℓ(ϕ(λ)). From assumption 3.1, ℓ(ϕ) is strongly convex and λT ℓ(ϕ) is a non-negative combination of convex functions. Thus, the Lagrangian L(f, λ) is strongly convex on ϕ for any fixed dual variable λ Rm +. The convexity and compactness of F imply that, in the unparametrized problem, the Lagrangian functional attains its minimizer ϕ(λ) for each λ. (see e.g, (Kurdila & Zabarankin, 2006) Theorem 7.3.1.) Then, by the strong convexity of L(ϕ, λ), this minimizer is unique. Since L(f, λ) is affine on λ, it is differentiable on λ. Then, by application of the Generalized Danskin s Theorem (see e.g: (Bas ar & Bernhard, 2008) Corollary 10.1) to gu(λ) and using that the set of minimizers ϕ(λ) of L(f, λ) is a singleton, we obtain: λgu(λ) = ℓ(ϕ(λ)), which completes the proof. A.4 PROOF OF LEMMA A.3: DISTANCE BETWEEN OPTIMAL DUAL VARIABLES Lemma A.3. Under assumptions 3.1, 3.2, 3.3, 3.4, the proximity between the unparametrized and parametrized optimal dual variables is characterized by: λ p λ u 2 2 2Mν µg (1 + λ p 1) (12) Since gu(λ) is differentiable (see A.2) and µg strongly concave for λ Hλ : gu(λ) gu(λ u) + gu(λ u)T (λ λ u) µg 2 λ λ u 2 2 λ Hλ From Lemma A.2 we have that gu(λ u) = ℓ(ϕ(λ u))), then evaluating at λ p we obtain: gu(λ p) gu(λ u) + ℓ(ϕ(λ u))T (λ p λ u) µg 2 λ p λ u 2 2 By complementary slackness, ℓ(ϕ(λ u))T λ u = 0. Then, since ϕ(λ u) is feasible and λ p 0: ℓ(ϕ(λ u))T λ p 0. Thus, gu(λ p) gu(λ u) µg 2 λ p λ u 2 2 By Proposition 1: gp(λ p) Mν(1 + λ p 1) gu(λ p), which implies: gp(λ p) Mν(1 + λ p 1) gu(λ u) µg 2 λ p λ u 2 2 λ p λ u 2 2 2 gu(λ u) gp(λ p) + 2 µg Mν(1 + λ p 1) (13) Finally, since Fθ F we have that : gu(λ) gp(λ) λ. Evaluating at λ u and using that λ p maximizes gp we obtain: gu(λ u) gp(λ u) gp(λ p) Using this in equation 13 we obtain, λ p λ u 2 2 2 µg Mν(1 + λ p 1) Published as a conference paper at ICLR 2024 A.5 PROOF OF PROPOSITION 3.2: PERTURBATION OF DUAL VARIABLES Proposition. 3.2 Under assumptions 3.1-3.4, the distance between the constraint violations of ϕ(λ p) and ϕ(λ u) is bounded by: ℓ(ϕ(λ p)) ℓ(ϕ(λ u)) 2 2 2β2 g µg Mν(1 + λ p 1) (14) The proof follows from straightforward applications of Lemma A.2 and Proposition A.3. Since the dual function gu(λ) is βg smooth, we have: ℓ(ϕ(λ p)) ℓ(ϕ(λ u)) 2 2 = λgu(λ p) λgu(λ u) 2 2 β2 g λ p λ u 2 2 Then, the bound between optimal dual variables given in Proposition A.3 yields: ℓ(ϕ(λ p)) ℓ(ϕ(λ u)) 2 2 2β2 g µg Mν(1 + λ p 1) which concludes the proof. A.6 PROOF OF LEMMA 3.1: CURVATURE OF THE DUAL FUNCTION Lemma. 3.1 Under assumptions 3.1, 3.2, 3.5 and 3.6, the unparametrized dual function gu(λu) is µg strongly concave and βg smooth on Hλ with: β2(1 + )2 and βg = m M 2 where = max( λ u 1, λ p 1). A.6.1 STRONG CONCAVITY CONSTANT µg As shown in Lemma A.2, the unparametrized Lagrangian has a unique minimizer ϕ(λ) for each λ Rm +. Let λ1, λ2 Hλ and ϕ1 = ϕ(λ1), ϕ2 = ϕ(λ2). By convexity of the functions ℓi : F R for i = 1, . . . , m, we have: ℓi(ϕ2) ℓi(ϕ1) + Dϕℓi(ϕ1), ϕ2 ϕ1 , ℓi(ϕ1) ℓi(ϕ2) + Dϕℓi(ϕ2), ϕ1 ϕ2 Multiplying the above inequalities by [λ1]i 0 and [λ2]i 0 respectively and adding them, we obtain: ℓ(ϕ2) ℓ(ϕ1), λ2 λ1 λT 1 Dϕℓ(ϕ1) λT 2 Dϕℓ(ϕ2), ϕ2 ϕ1 (16) Since gu(λ) = L(ϕ(λ)), we have that: gu(λ2) gu(λ2), λ2 λ1 λT 1 DϕL(ϕ1) λT 2 DϕL(ϕ2), ϕ2 ϕ1 (17) Moreover, first order optimality conditions yield: Dϕℓ0(ϕ1) + λT 1 Dϕℓ(ϕ1) = 0, Dϕℓ0(ϕ2) + λT 2 Dϕℓ(ϕ2) = 0 (18) where 0 denotes the null-opereator from F to R (see e.g: (Kurdila & Zabarankin, 2006) Theorem 5.3.1). Published as a conference paper at ICLR 2024 Combining equations 17 and 18 we obtain: gu(λ2) gu(λ2), λ2 λ1 Dϕℓ0(ϕ2) Dϕℓ0(ϕ1), ϕ2 ϕ1 µ0 ϕ2 ϕ1 2 L2 (19) where we used the µ0 strong convexity of the operator ℓ0. We will now obtain a lower bound on ϕ2 ϕ1 L2, starting from the β smoothness of ℓ0: β Dϕℓ0(ϕ2) Dϕℓ0(ϕ1) L2 β λT 2 Dϕℓ(ϕ2) λT 1 Dϕℓ(ϕ1) L2 β (λ2 λ1)T Dϕℓ(ϕ2) λT 1 (Dϕℓ(ϕ1) Dϕℓ(ϕ2)) L2 Then, second term in the previous equality can be characterized using assumption 3.6: (λ2 λ1)T Dϕℓ(ϕ2) L2 σ λ2 λ1 2 (21) For the second term, using the β smoothness of ℓi we can derive: λT 1 (Dϕℓ(ϕ1) Dϕℓ(ϕ2)) L2 = i=1 [λ1]i(Dϕℓi(ϕ1) Dϕℓi(ϕ2)) L2 i=1 [λ1]i Dϕℓi(ϕ1) Dϕℓi(ϕ2) L2 i=1 [λ1]iβ ϕ1 ϕ2 L2 = β λ1 1 ϕ1 ϕ2 L2 Then, using the reverse triangle inequality: (λ2 λ1)T Dϕℓ(ϕ2) λT 1 (Dϕℓ(ϕ1) Dϕℓ(ϕ2)) L2 (λ2 λ1)T Dϕℓ(ϕ2) L2 λT 1 (Dϕℓ(ϕ1) Dϕℓ(ϕ2)) L2 σ λ2 λ1 2 β λ1 1 ϕ2 ϕ1 L2 Combining this with equation 20 we obtain: β (σ λ2 λ1 2 β λ1 1 ϕ2 ϕ1 L2) ϕ2 ϕ1 L2 σ β(1 + λ1 1) λ2 λ1 2 (24) This means that we can write equation 19 as: gu(λ2) gu(λ1), λ2 λ1 µ0 σ2 β2(1 + λ1 1)2 λ2 λ1 2 2 Letting λ2 = λ u, we obtain that the strong concavity constant of gu in Hλ is µg = µ0 σ2 β2(1+max{ λ u 1, λ p 1})2 . A similar proof in the finite dimensional case can be found in (Guigues, 2020). A.6.2 SMOOTHNESS CONSTANT βg Set λ1, λ2 Rm +, and let ϕ1 = ϕ(λ1) and ϕ2 = ϕ(λ2) denote the Lagrangian minimizers associated to these multipliers. Published as a conference paper at ICLR 2024 Since the unparametrized Lagrangian is differentiable and µ0-strongly convex we have: L(f, λ) L(ϕ(λ), λ) + DϕL(ϕ(λ), λ)), f ϕ(λ) + µ0 2 f ϕ(λ) 2 L2 Using that ϕ(λ) is a minimizer, we obtain (see e.g: (Kurdila & Zabarankin, 2006) Theorem 5.3.1) : L(ϕ(λ), λ) L(f, λ) µ0 2 f ϕ(λ) 2 2, f F Applying this to ϕ2 and ϕ1 we obtain: ℓ0(ϕ2) + λT 2 ℓ(ϕ2) ℓ0(ϕ1) + λT 2 ℓ(ϕ1) µ0 2 ϕ2 ϕ1 2 L2 ℓ0(ϕ1) + λT 1 ℓ(ϕ1) ℓ0(ϕ2) + λT 1 ℓ(ϕ2) µ0 2 ϕ2 ϕ1 2 L2 Summing the above inequalities and applying Cauchy-Schwarz: µ0 ϕ2 ϕ1 2 2 (λ2 λ1)T (ℓ(ϕ1) ℓ(ϕ2)) λ2 λ1 2 ℓ(ϕ1) ℓ(ϕ2) 2 m M λ2 λ1 2 ϕ1 ϕ2 L2 where the last inequality follows from assumption 3.1. Then, applying Lemma A.2 we obtain: λgu(λ2) λgu(λ1) 2 = ℓ(ϕ2) ℓ(ϕ1) 2 M ϕ2 ϕ1 L2 which means that gu has a smoothness constant βg = m M 2 A.7 PROOF LEMMA A.4 Lemma A.4. Let P denote the Fenchel conjugate of the perturbation function P (ϵ). For every λ Rm + we have that P (λ) = gu(λ). By definition of Fenchel conjugate: P (λ) = sup ϵ λT ϵ P (ϵ) (25) Using the definition of the perturbation function P (ϵ) we obtain: P (λ) = sup ϕ F,ϵ λT ϵ ℓ0(ϕ) s.t: ℓ(ϕ) + ϵ 0 (26) Applying the change of variable z = ℓ(ϕ) + ϵ, P (λ) can be written as: P (λ) = sup ϕ F,z λT z λT ℓ(ϕ) ℓ0(ϕ) s. to: z 0 (27) Since z 0, the term λT z is unbounded above for λ 0. Thus, we restrict the domain of P (λ) to λ 0. In this region, maximizing over z Rm yields z = 0. We can thus write P (λ) as: P (λ) = sup ϕ F λT ℓ(ϕ) ℓ0(ϕ), λ 0 = inf ϕ F λT ℓ(ϕ) + ℓ0(ϕ), λ 0 (28) Therefore, P (λ) = gu(λ), λ 0. Similar versions of this result can be found in (Rockafellar, 1997), Section 28, (Guigues, 2020), Lemma 2.9 or (Rockafellar, 1974), Theorem 7. Published as a conference paper at ICLR 2024 A.8 PROOF OF COROLLARY A.1: CURVATURE OF THE PERTURBATION FUNCTION Corollary A.1. Let Bϵ = {γϵu + (1 γ)ϵp : γ [0, 1]} denote the segment connecting ϵu and ϵp. The perturbation function P (ϵ) is µϵ strongly convex on Bϵ with constant: µϵ = 1/βg. We begin by stating a well-known Lemma on the duality between smoothness and strong convexity Lemma A.5. Let h be a closed convex function defined on a subset of the vector space X; h is µ strongly convex if and only if h has µ Lipschitz continuous gradients. (See e.g, (Kakade et al., 2009) or (Goebel & Rockafellar, 2008)). In order to apply Lemma A.5 we need to show that the perturbation function P(ϵ) is convex and closed in the region of interest. Convexity of P (ϵ) for convex functional programs is shown in (Bonnans & Shapiro, 1998) or (Rockafellar, 1997) Theorem 29.1. Now we will show that P (ϵ) is proper and lower semi continuous in the region of interest, which implies that it is closed. The functional ℓ0, defined on the compact set F, is smooth. Thus, it is bounded on F. From assumption 3.2 we have that the problem is feasible for ϵ = 0. Therefore, P(0) < + . Moreover, by boundedness of ℓ0, P(ϵ) > ϵ, implying that P(ϵ) is proper. Now, fix ϵ0 Bϵ. Assumption 3.2 implies that the perturbed problem with constraint: L(f)+ϵ0 0 is strictly feasible. Since this perturbed problem is convex and strictly feasible, its perturbation function P(ϵ) is lower semi continuous at 0 (see (Bonnans & Shapiro, 1998) Theorem 4.2). Note that P(ϵ) = P (ϵ + ϵ0). Thus, P (ϵ) is lower semi continuous at ϵ0. We conclude that P (ϵ) is proper and lower semi continuous for all ϵ Bϵ. On the other hand, from Corollary 3.1 P (λ) = gu(λ) is βg-smooth on Rm +. Thus, we are in the hypothesis of Proposition A.4, which implies that P (ϵ) is strongly convex on Bϵ with constant 1 βg . A.9 PROOF OF PROPOSITION A.1: SUBGRADIENTS OF P Proposition A.1. Under assumptions 3.1 and 3.2, λ p is a subgradient of the perturbation function at ϵu. That is, λ p P (ϵu). The conjugate nature of the dual function gu(λ) and the perturbation function P (ϵ) also establishes a dependence between their first order variations. This dependence is captured in the following lemma. Lemma A.6. If h is a closed convex function, the subdifferential h is the inverse of h in the sense of multivalued mappings (see (Rockafellar, 1997) Corollary 23.5.1): x h (y) y h(x) On one hand, from Lemma A.2, we have that λgu(λ p) = ℓ(ϕ(λ p)) = ϵu. On the other hand, from Lemma A.4, P (λ) = gu(λ) for all λ Rm +. Taking the gradient with respect to λ and evaluating at λ p we obtain: λP (λ p) = ϵu. Then, Lemma A.6, yields the sensitivity result: λ p P (ϵu). A.10 PROOF OF PROPOSITION A.2: DISTANCE BETWEEN OPTIMAL VALUES Proposition A.2. Under assumptions 3.3 and 3.1, the difference between the optimal values of problems perturbed by ϵp and ϵu is bounded: P (ϵp) P (ϵu) Mν(1 + λ p ) + λ T p (ϵp ϵu) Recall that ϵu = ℓ(ϕ(λ p)) and ϵp = ℓ(fθ(λ p)). We want to show that: P (ϵp) P (ϵu) Mν(1 + λ p ) + λ T p (ϵp ϵu) Published as a conference paper at ICLR 2024 We start by showing that P (ϵp) ℓ0(fθ(λ p)). Note that fθ(λ p) is feasible in the perturbed problem, since its constraint value is ϵp. Then, P(ϵp) = min f {L0(f) : L(f) + ϵp 0} L0(fθ(λ p)) Therefore, P (ϵp) P (ϵu) ℓ0(fθ(λ p)) P (ϵu). (29) Note that the dual function of the problem perturbed by ϵu is gu(λ, ϵu) := minϕ F {ℓ0(f) + λT (ℓ(ϕ) + ϵu)}. Then, weak duality implies that P (ϵu) gu(λ, ϵu) for all λ. Evaluating at λ p we obtain: P (ϵu) min ϕ F {L0(f) + λ T p (L(f) + ϵu)} = min ϕ F {ℓ0(ϕ) + λ T p ℓ(ϕ)} + λ T p ϵu = gu(λ p) + λ T p ϵu Combining equations 30 and 29 we obtain: P (ϵp) P (ϵu) ℓ0(fθ(λ p)) gu(λ p) λ T p ϵu = ℓ0(fθ(λ p)) λ T p ϵp gu(λ p) λ T p ϵu (31) Recall that ϵp = ℓ(fθ(λ p)). Then, we can identify the parametrized dual function gp(λ) = ℓ0(fθ(λ p)) λ T p ϵp and write equation 31 as: P (ϵp) P (ϵu) gp(λ p) gu(λ p) + λ T p (ϵp ϵu) Finally, leveraging the bound between dual functions from Lemma A.1 we obtain: P (ϵp) P (ϵu) Mν(1 + λ p 1) + λ T p (ϵp ϵu), which conludes the proof. A.11 PROOF OF PROPOSITION 3.3: FUNCTION CLASS PERTURBATION Proposition. 3.3 Under assumptions 3.1-3.4, the distance between constraint violation associated to the parametrization of the hypothesis class is given by: ℓ(ϕ(λ p)) ℓ(fθ(λ p)) 2 2 2βg Mν(1 + λ p 1) Let ϵ = ϵp ϵu, using the strong convexity constant obtained in Proposition A.1 we have that: P (ϵp) P (ϵu) + s T ϵ + 1 2βg ϵ 2 2 where s P (ϵu) is a subgradient of P (ϵ) at ϵu. From Proposition A.1 we know that: λ p P (ϵu). Thus, P (ϵp) P (ϵu) + λ T p ϵ + 1 2βg ϵ 2 2 Using the bound on P (ϵp) P (ϵu) obtained in proposition A.2 we can write: Mν(1 + λ p 1) + λ T p ϵ λ T p ϵ + 1 2βg ϵ 2 2 Mν(1 + λ p 1) 1 2βg ϵ 2 2 This implies: ϵ 2 2 2βg Mν(1 + λ p 1) ℓ(ϕ) ℓ(fθ(λ p)) 2 2 2βg Mν(1 + λ p 1) which concludes the proof. Published as a conference paper at ICLR 2024 A.12 PROOF OF PROPOSITION 3.1: 2-NORM NEAR-FEASIBILITY Proposition. 3.1 Under assumptions 3.1-3.4, for all fθ(λ p) F θ (λ p), the distance between the unparametrized and parametrized constraint violations is bounded by: ℓ(fθ(λ p)) ℓ(ϕ ) 2 2 2βg Mν(1 + λ p 1) Proposition 3.1 stems from combining the feasibility bounds in Corollary 3.2 and Proposition 3.3: ℓ(ϕ(λ p)) ℓ(ϕ ) 2 µg Mν(1 + λ p 1) (32) ℓ(ϕ(λ p)) ℓ(fθ(λ p)) 2 q 2βg Mν(1 + λ p 1) (33) Combining the above equations through a triangle inequality we obtain: ℓ(fθ(λ p)) ℓ(ϕ(λ u)) 2 µg Mν(1 + λ p 1) + q 2βg Mν(1 + λ p 1) (34) 2βg Mν(1 + λ p 1) Taking squares on both sides yields the desired result. A.13 PROOF OF THEOREM 3.1: NEAR-FEASIBILITY AND NEAR-OPTIMALITY Theorem. 3.1 Under assumptions 3.1, 3.2, 3.3, 3.5 and 3.6, the sub-optimality and near-feasibility of all fθ(λ p) F(λ p) is bounded by ℓ(fθ(λ p)) ℓ(ϕ ) M [1 + κ1κ0(1 + )] µ0 (1 + λ p 1) := Γ2 (36) |P p ℓ0(fθ(λ p))| λ p 1Γ2 + (1 + λ p 1)Mν + Γ1 (37) with κ1 = M σ , κ0 = β µ0 , = max{ λ u 1, λ p 1} and Γ1 as in eq. 4. A.13.1 NEAR-FEASIBILITY Recall that Lemma 3.1 characterizes the strong concavity µg and smoothness βg of the dual function in terms of the properties of the losses ℓi and the functional space F. The proof of this theorem stems from applying Lemma 3.1 to the 2-norm bound in Theorem 3.1. We start by observing that: ℓ(fθ(λ p)) ℓ(ϕ(λ u)) ℓ(fθ(λ p)) ℓ(ϕ(λ u)) 2 (38) 2βg Mν(1 + λ p 1)(1 + βg µg ) (39) From proposition 3.1, we have that µg = µ0 σ2 β2(1+ )2 and βg = m M 2 µ0 . This implies that βg µg = m M 2 µ2 0 (1 + )2 where = max{ λ u 1, λ p 1}. Plugging this into equation 39, we obtain: ℓ(fθ(λ p)) ℓ(ϕ(λ u)) Mm1/4 s µ0 (1 + λ p 1) 1 + m1/4 M σ β µ0 (1 + ) µ0 (1 + λ p 1) 1 + M σ β µ0 (1 + ) m Published as a conference paper at ICLR 2024 Finally, using the definitions of the condition numbers κ1 = M σ , κ0 = β µ0 we obtain: ℓ(fθ(λ p)) ℓ(ϕ(λ u)) M [1 + κ1κ0(1 + )] µ0 (1 + λ p 1) (40) which is the desired near-feasibility bound. A.13.2 NEAR-OPTIMALITY To derive the near-optimality bound, we combine equation 40 with the duality gap bound from (Chamon et al., 2023, Prop. 3.3): P p D p Mν(1 + λ 1) := Γ1, (41) where λ maximizes gp(λ) = gp(λ) + Mν λ 1. Since P p P u we have: P u D p Γ1 ℓ0(ϕ ) ℓ0(fθ(λ p)) λ T p ℓ(fθ(λ p)) + Γ1 Then, using that the solution of the unparametrized problem ϕ is feasible (i.e, ℓ(ϕ )) 0) and λ p 0 we obtain ℓ0(ϕ ) ℓ0(fθ(λ p)) λ T p (ℓ(fθ(λ p) ℓ(ϕ )) + Γ1 λ p 1 ℓ(ϕ ) ℓ(fθ(λ p)) + Γ1 To conclude the derivation, note that the ν-universality and M-Lipschitz continuity (Assumptions 3.3 and 3.1) imply that there exists θ such that |ℓi(ϕ ) ℓi(fθ )| Mν for all i = 0, . . . , m. Thus, ℓ0(ϕ ) ℓ0(fθ(λ p)) ℓ0(ϕ ) ℓ0(fθ ) ℓ0(fθ(λ p)) ℓ0(fθ ) Mν ℓ0(fθ(λ p)) (42) Note that the duality gap bound implies the approximate saddle-point relation: L(fθ(λ p), λ u) Γ1 L(fθ(λ p), λ p) L(fθ , λ p) + Γ1. Applying the right-hand side of this inequality to equation 42 we obtain: ℓ0(ϕ ) ℓ0(fθ(λ p)) λ T p (ℓ(fθ(λ p)) ℓ(ϕ )) Γ1 (1 + λ p 1)Mν λ p 1 ℓ(ϕ ) ℓ(fθ(λ p)) (1 + λ p 1)Mν Γ1 which completes the proof. A.14 PROOF OF LEMMA 4.1: BEST ITERATE CONVERGENCE Lemma. 4.1 Let gbest p (t|λ(t0)) = maxs [t0,t] gp(λ(s)) be the maximum value of the parametrized dual function up to time t. Then, lim t gbest p (t|λ(t0)) D p ηS2 where S2 > E[ ˆs(t) 2|λ(t)] is an upper bound on the norm of the second order moment of the stochastic supergradients. A similar proof in the context of resource allocation for wireless communications can be found in (Ribeiro, 2010), Theorem 2. To ease the notation, we will denote the value of the parametrized dual function at iteration t by g(t) := gp(λ(t)). Similarly, gbest(t) will denote the largest value of g(t) encountered so far. We start by deriving a recursive inequality between the distances of iterates λ(t) and an optimal dual variable λ p arg maxλ 0 gp(λ). Published as a conference paper at ICLR 2024 Proposition A.3. Consider the dual ascent algorithm described in Section 4 using a constant step size η > 0. Then, E{ λ(t + 1) λ p 2|λ(t)} λ(t) λ p 2 + η2S2 2η(D p g(t) ρ) (43) We delay the proof of Proposition A.3 to section A.14.1. We can observe that as the optimality gap D p g(t) decreases, the fixed term η2S2 dominates the right hand side of equation (43), suggesting convergence of λ(t) only to a neighborhood of λ p. In order to show this, the main obstacle is that Proposition A.3 bounds the expected value of λ(t+1) λ p 2 and we wish to establish almost sure convergence. This can be addressed by leveraging the Supermartingale Convergence Theorem (see e.g, (Solo & Kong, 1994) Theorem E7.4), which we state here for completeness. Theorem A.1. Consider nonnegative stochastic processes A(N) and B(N) with realizations α(N) and β(N) having values α(t) 0 and β(t) 0 and a sequence of nested σ-algebras A(0 : t) measuring at least α(0 : t) and β(0 : t). If E[α(t + 1) | A(0 : t)] α(t) β(t) (44) the sequence α(t) converges almost surely and β(t) is almost surely summable, i.e., P u=1 β(u) < a.s. We define α(t) and β(t) as follows, α(t) := λ(t) λ p 2 I D p gbest(t) > ηS2 β(t) := [2η(D p g(t) ρ) η2S2] I D p gbest(t) > ηS2 Note that α(t) tracks λ(t) λ p 2 until the optimality gap D p gbest(t) falls bellow the threshold 2 + ρ and is then set to 0. Similarly, β(t) tracks 2η(D p g(t) ρ) η2S2 until the optimality gap D p gbest(t) falls bellow the same threshold and is then set to 0. It is clear that α(t) 0, since it is the product of a norm and an indicator function. The same holds for β(t), since the indicator evaluates to 0 whenever 2η(D p g(t) ρ) η2S2 0. We thus have, α(t), β(t) 0 for all t. We will leverage Theorem A.1 to show that β(t) is almost surely summable. Let A(0 : t) be a sequence of σ-algebras measuring α(0 : t), β(0 : t) and λ(0 : t). We will show that α(t) and β(t) satisfy the hypothesis of Theorem A.1 with respect to A(0 : t). Note that at each iteration, α(t) and β(t) are fully determined by λ(t). Therefore, conditioning on A(0 : t) is equivalent to conditioning on λ(t), i.e: E{α(t)|A(0 : t)} = E{α(t)|λ(t)}. Then we can write, E{α(t)|A(0 : t)} =E{α(t)|λ(t), α(t) = 0}P{α(t) = 0} + E{α(t)|λ(t), α(t) > 0}P{α(t) > 0} (45) From equation 45, we will derive that E{α(t)|A(0 : t)} α(t) β(t) which is the remaining hypothesis in Theorem A.1. On one hand, observe that if α(t) = 0 we have that I{D p gbest(t) ηS2 2 + ρ} = 0. This is because in the case where λ(t) λ p 2 = 0, the indicator function also evaluates to 0. Therefore, if α(t) = 0, it must be that β(t) = 0. Then, trivially, E{α(t)|λ(t), α(t) = 0} = α(t) β(t). On the other hand, when α(t) > 0: E[α(t + 1) | λ(t), α(t) > 0] (46) ( λ(t + 1) λ p 2 I D p gbest(t + 1) > η ˆS2 E n λ(t + 1) λ p 2 | λ(t) o (48) Published as a conference paper at ICLR 2024 where we used the definition of α(t + 1) and the fact the the indicator function is not larger than 1. Then, from proposition A.3 we have: E[α(t + 1) | λ(t), α(t) > 0] λ(t) λ p 2 + η2S2 2η(D p g(t) ρ) (49) = α(t) β(t). (50) where the last equality comes from the fact that α(t) > 0 implies I n D p gbest(t + 1) > η ˆS2 2 + ρ o = 1. This means that we can write equation 45 as: E{α(t)|A(0 : t)} [α(t) β(t)](P{α(t) = 0} + P{α(t) > 0}) = α(t) β(t) (51) which shows that α(t) and β(t) satisfy the hypothesis of Theorem A.1. Then, we have that β(t) is almost surely summable, which implies, lim inf t 2η(D p g(t) ρ) η2S2 I{D p gbest(t) > η ˆS2 + ρ/2} = 0 a.s. This is true if either D p gbest(t) ηS2 2 + ρ for some t, or if lim inft 2η(D p g(t) ρ) η2S2 = 0, which concludes the proof. A.14.1 PROOF OF PROPOSITION A.3 Proposition. A.3 Consider the stochastic supergradient ascent algorithm from Section 4 using a constant step size η > 0. Then, E{ λ(t + 1) λ p 2|λ(t)} λ(t) λ p 2 + η2S2 2η(D p g(t) ρ) (52) Let ˆs(t) denote the approximate stochastic supergradient ˆℓi(f θ(t)). From the definition of λ(t + 1): λ(t + 1) λ p 2 = [λ(t) + ηˆs(t)]+ λ p 2 λ(t) λ p + ηˆs(t) 2 = λ(t) λ p 2 + η2 ˆs(t) 2 + 2ηˆs(t)T (λ(t) λ p) where we used the fact that setting the negative components of λ(t)+ηˆs(t) to 0 decreases its distance to the positive vector λ p and then expanded the square. Note that for a given λ(t), the relations in 53 hold for all realizations of ˆs(t). Thus, the expectation of λ(t + 1) λ p , conditioned on λ(t) satisifes: E{ λ(t+1) λ p 2|λ(t)} λ(t) λ p 2+η2E{ ˆs(t) 2|λ(t)}+2ηE{ˆs(t)|λ(t)}(λ(t) λ p) (54) Furthermore, the stochastic supergadient ˆs(t) yields, on average, an approximate ascent direction of the dual function gp: E{ˆs(t)|λ(t)}(λ(t) λ) ρ g(t) gp(λ). (55) Evaluating the previous inequality at λ p and combining it with equation 54 we obtain: E{ λ(t + 1) λ p 2|λ(t)} λ(t) λ p 2 + η2S2 + 2η(g(t) D p + ρ) (56) which concludes the proof. A.15 PROOF PROPOSITION 4.1 We will bound the distance between ℓ(ϕ(λ u)) and ℓ(fθ(λbest))) by partioning it into terms that we have previously analyzed in Corollary 3.2 and Proposition 3.3: ℓ(ϕ(λ u)) ℓ(fθ(λbest))) 2 ℓ(ϕ(λ u)) ℓ(ϕ(λbest)) 2 + ℓ(ϕ(λbest)) ℓ(fθ(λbest)) 2 Published as a conference paper at ICLR 2024 The first term is of the same nature as the one analyzed in Corollary 3.2, since it is characterizes a perturbation in dual variables in the unparametrized problem. Thus, using the characterization of the curvature of the dual function from proposition A.3 and the sub-optimality of λbest with respect to λ p, this term can be bounded. We denote by Bλbest the segment connecting λbest and λ u and by µg the strong concavity constant of gu in Bλbest. Using Lemma A.1 and the fact that gp(λ p) gu(λ u) we obtain: λbest λ u 2 2 2 µg (gu(λ u) gu(λbest)) µg (gp(λ p) gp(λbest) Mν(1 + λbest 1) Then, leveraging the almost sure convergence shown in Proposition 4.1 we have: λbest λ u 2 2 2 Mν(1 + λbest 1) + ηS2 Note that equation 57 corresponds to the bound in Proposition A.4 but amplified by the suboptimality of λbest with respect to λ p. Then, since the gradient λgu(λ) = ℓ(ϕ(λ)) is βg-Lipschitz continuous: ℓ(ϕ(λbest)) ℓ(ϕ(λ u)) 2 2 = λgu(λbest) λgu(λ u) 2 2 (58) β2 g λbest λ u 2 2 (59) Mν(1 + λbest 1) + ηS2 which completes the first part of the proof. The term ℓ(ϕ(λbest)) ℓ(fθ(λbest)) 2 captures a perturbation in the function class for a fixed dual variable, and can be bounded by leveraging the perturbation analysis of Proposition 3.3. Let ϵu = ℓ(ϕ(λbest)) and ϵp = ℓ(fθ(λbest)). First note that the duality between smoothness and strong convexity detailed in Corollary A.1 implies that P (ϵ) is strongly convex with constant 1 βg on Bλbest. Then, as in Proposition A.2, we can bound the distance between the optimal values associated to these perturbations ϵp and ϵu by: P ( ϵp) P ( ϵu) Mν(1 + λbest 1) + λbest T ( ϵp ϵu) (61) The strong convexity of P combined with equation 61 yields: Mν(1 + λbest 1) + λbest T ϵ λbest T ϵ + 1 2βg ϵ 2 2 (62) which implies: ϵ 2 2 2βg Mν(1 + λbest 1) (63) We conclude the proof by summing the bounds in equations 60 and 63 to obtain: ℓ(ϕ(λ u)) ℓ(fθ(λbest))) 2 (64) 2βg Mν(1 + λbest 1) + Mν(1 + λbest 1) + ηS2 Mν(1 + λbest 1) + ηS2 Mν(1 + λbest 1) + ηS2 Mν(1 + λbest 1) + ηS2 which concludes the proof. Published as a conference paper at ICLR 2024 A.16 ADDITIONAL EXPERIMENTAL DETAILS We adopt the same data pre-processing steps as in (Chamon & Ribeiro, 2020) and use a two-layer neural network with 64 nodes and sigmoid activations. The counterfactual fairness constraint upper bound is set to 0.001. We train this model over T = 400 iterations using a ADAM, with a batch size of 256, a primal learning rate equal to 0.1 and weight decay magnitude set to 10 4. The dual variable learning rate is set to 2.