# resilient_constrained_learning__c97acdeb.pdf Resilient Constrained Learning Ignacio Hounie University of Pennsylvania Alejandro Ribeiro University of Pennsylvania Luiz F. O. Chamon University of Stuttgart When deploying machine learning solutions, they must satisfy multiple requirements beyond accuracy, such as fairness, robustness, or safety. These requirements are imposed during training either implicitly, using penalties, or explicitly, using constrained optimization methods based on Lagrangian duality. Either way, specifying requirements is hindered by the presence of compromises and limited prior knowledge about the data. Furthermore, their impact on performance can often only be evaluated by actually solving the learning problem. This paper presents a constrained learning approach that adapts the requirements while simultaneously solving the learning task. To do so, it relaxes the learning constraints in a way that contemplates how much they affect the task at hand by balancing the performance gains obtained from the relaxation against a user-defined cost of that relaxation. We call this approach resilient constrained learning after the term used to describe ecological systems that adapt to disruptions by modifying their operation. We show conditions under which this balance can be achieved and introduce a practical algorithm to compute it, for which we derive approximation and generalization guarantees. We showcase the advantages of this resilient learning method in image classification tasks involving multiple potential invariances and in heterogeneous federated learning. 1 Introduction Requirements are integral to engineering and of growing interest in machine learning (ML) [1]. This growing interest is evident in, e.g., the advancement towards designing ML systems that are fair [2], robust [3], and safe [4], as well as numerous applications in which we want to attain good performance with respect to more than one metric [5 24]. Two concrete applications that we will use as examples (Section 5) are heterogeneous federated learning, where each client realizes a different loss due to distribution shifts (as in, e.g., [18 20]) and invariant learning, where we seek to achieve good performance even after the data has undergone a variety of transformations (as in, e.g., [15,21 23]). The goal in these settings is to strike a compromise between some top-line objective metric and the requirements. To this end, an established approach is to combine the top-line and requirement violation metrics in a single training loss. This leads to penalty methods that are ubiquitous in ML, as attested by, e.g., fairness [25 28] and robustness [29,30] applications. Another approach to balance objective and requirements is formulating and solving constrained learning problems. Though less typical in ML practice, they are not uncommon [5 18,21,24]. In particular, they have also been used in fair [31 34] and robust [12,35] learning, to proceed with a common set of applications. It is worth noting that penalty and constrained methods are not unrelated. They are in fact equivalent in convex optimization, in the sense that every constrained problem has an equivalent penalty-based formulation. A similar result holds in non-convex ML settings for sufficiently expressive parametrizations [7,8]. In either case, the compromise between objective and different requirements are specified by (hyper)parameters, be they penalty coefficients or constraint levels (Section 2). Finding penalties or constraints specifications that yield reasonable trade-offs is particularly difficult in ML, which often 37th Conference on Neural Information Processing Systems (Neur IPS 2023). involves statistical requirements, such as fairness and robustness, that have intricate dependencies with the model and unknown data distributions. Case in point, consider invariant learning for image classification [23]. While we know that invariance to rotations and translations is desirable, we do not know how much invariance to these transformations is beneficial. This depends on the level of invariance of the data distribution, its prevalence in the dataset, and the capability of the model to represent invariant functions. The standard solution to this problem involves time consuming and computationally expensive hyperparameter searches. This paper addresses this issue by automating the specification of constraint levels during training. To do so, it begins by interpreting constraints as nominal specifications that can be relaxed to find a better compromise between objective and requirements (Section 3). We call this approach resilient constrained learning after the term used to describe ecological systems that adapt to disruptions by modifying their operation [36,37]. Our first contribution is the following insight: (C1) We relax constraints according to their relative difficulty, which we define as the sensitivity of the objective to perturbations of the constraint (Section 3.1). That difficult constraints should be relaxed more is a natural choice. The value of (C1) is in defining what is a difficult constraint. We then seek constraint levels such that the objective loss is relatively insensitive to changes in those levels. This relative insensitivity incorporates a user-defined cost that establishes a price for relaxing nominal specifications. The learning problem implied by (C1) seems challenging. Our next contribution is to show that it is not: (C2) We use duality and perturbation theory to present reformulations of the resilient learning problem from (C1) (Section 3.2) that lead to a practical resilient learning algorithm (Section 4) for which we derive statistical approximation bounds (Thm. 1). Our final contribution is the experimental evaluation of the resilient learning algorithm: (C3) We evaluate resilient formulations of federated learning and invariant learning (Section 5). Our experiments show that (C1) (C2) effectively relaxes constraints according to their difficulty, leading to solutions that are less sensitive to the requirement specifications. It illustrates how resilient learning constitutes an interpretable and flexible approach to designing requirements while contemplating performance trade-offs. 2 Learning with Constraints Let D0 be a distribution over data pairs (x, y) composed of the feature vector x X Rd and the corresponding output y Y R. Let fθ : X Rk be the function associated with parameters θ Θ Rp and ℓ0 : Rk Y [ B, B] be the loss that evaluates the fitness of the estimate fθ(x) relative to y. Let Fθ = {fθ | θ Θ} be the hypothesis class induced by these functions. Different from traditional (unconstrained) learning, we do not seek fθ that simply minimizes E[ℓ0(ϕ(x), y)], but also account for its expected value with respect to additional losses ℓi : Rk Y [ B, B] and distributions Di, i = 1, . . . , m. These losses/distributions typically encode statistical requirements, such as robustness (where Di denote distribution shifts or adversarial perturbations) and fairness (where Di are conditional distributions of protected subgroups). Explicitly, the constrained statistical learning (CSL) problem is defined as P = min θ Θ E(x,y) D0 h ℓ0 fθ(x), y i subject to E(x,y) Di h ℓi fθ(x), y i 0, i = 1, . . . , m. (P) Without loss of generality, we stipulate the nominal constraint specification to be zero. Other values can be achieved by offsetting ℓi, i.e., E ℓ(fθ(x), y) c is obtained using ℓi( ) = ℓ( ) c in (P). We also let P take values on the extended real line R { } by defining P = whenever (P) is infeasible, i.e., whenever for all θ Θ there exists i such that EDi ℓi(fθ(x), y) > 0. A challenge in formulating meaningful CSL problems lies in specifying the constraints, i.e., the ℓi. Indeed, while a solution fθ always exists for unconstrained learning [m = 0 in (P)], there may be no θ that satisfies the constraints in (P). This issue is exacerbated when solving the problem using data: even arbitrarily good approximations of the expectations in (P) may introduce errors that hinder the estimation of P (see Appendix A for a concrete example). In practice, landing on feasible requirements may require some constraints to be relaxed relative to their initial specification. Then, in lieu of (P), we would use the relaxed problem P (u) = min θ Θ E(x,y) D0 h ℓ0 fθ(x), y i subject to E(x,y) Di h ℓi fθ(x), y i ui, i = 1, . . . , m, (Pu) where u Rm + collects the relaxations ui 0. The value P (u) is known as the perturbation function of (P) since it describes the effect of the relaxation u on the optimal value. Given P (0) = P and (P0) is equivalent to (P), this abuse of notation should not lead to confusion. It is ready that P (u) is a componentwise non-increasing function, i.e., that for comparable arguments vi wi for all i (denoted v w), it holds that P (v) P (w). However, large relaxations u drive (Pu) away from (P), the learning problem of interest. Thus, relaxing (P) too much can be as detrimental as relaxing it too little (see example in Appendix A). The goal of this paper is to exploit properties of P (u) together with a relaxation cost to strike a balance between these conflicting objectives. We call this balance a resilient version of (P). 3 Resilient Constrained Learning We formulate resilient constrained learning using a functional form of (Pu). Explicitly, consider a convex function class F Fθ and define the relaxed functional problem P (u) = min ϕ F E(x,y) D0 h ℓ0 ϕ(x), y i subject to E(x,y) Di h ℓi ϕ(x), y i ui, i = 1, . . . , m. ( Pu) The difference between (Pu) and ( Pu) is that the latter does not rely on a parametric model. Instead, its solutions take values on the convex space of functions F. Still, if Fθ is a sufficiently rich parameterization of F (see Section 4 for details), then P (u) and P (u) are close [38, Sec. 3]. Throughout the rest of the paper, we use to identify these functional problems. The advantage of working with ( Pu) is that under mild conditions the perturbation function P is convex. This holds readily if the losses ℓi are convex (e.g., quadratic or cross-entropy) [39, chap. 5]. However, the perturbation function is also convex for a variety of non-convex programs, e.g., ( Pu) with non-convex losses, non-atomic Di, and decomposable F (see Appendix B). For conciseness, we encapsulate this hypothesis in an assumption. Assumption 1 . The perturbation function P (u) is a convex function of the relaxation u Rm +. A consequence of Assumption 1 is that the perturbation function P has a non-empty subdifferential at every point. Explicitly, let its subdifferential P (uo) at uo Rm + be defined as the set of vectors describing supporting hyperplanes at uo of the epigraph of P , i.e., P (uo) = n p Rm + P (v) P (uo) + p T (v uo), for all v Rm + o . (1) The elements of P (uo) are called subgradients of P at uo. If P is differentiable at uo, then it has a single subgradient that is equal to its gradient, i.e., P (uo) = { P (uo)}. In general, however, the subdifferential is a non-singleton set [40]. Further notice that since P (u) is componentwise nonpositive the subgradients are componentwise negative, pu 0. Next, we use this property to formalize resilient constrained learning as a compromise between reducing P (u) by increasing u and staying close to the original problem ( P0). Figure 1: Resilient equilibrium from Def. 1 for h(u) = u2/2. The shaded area indicates infeasible specifications: (a) nominal specification (u = 0) is feasible and easy to satisfy; (b) nominal specification is feasible but difficult to satisfy (close to infeasible); (c) nominal specification is infeasible. 3.1 Resilient Equilibrium Consider the effect of increasing the value of a specific relaxation ui in ( Pu) while keeping the rest unchanged. The solution of ( Pu ) is allowed to suffer higher losses on the constraints (ℓi for i 1), which may lead to a smaller objective loss (ℓ0). Hence, while larger relaxations are detrimental because they violate more the requirements, they are also beneficial because they reduce the objective loss. To balance these conflicting outcomes of constraint relaxations, we introduce a function h to capture their costs. Then, since relaxing both increases the costs and decreases the objective value, we conceptualize resilience as an equilibrium between these variations. Definition 1 (Resilient Equilibrium). Let h : Rm + R+ be a convex, differentiable, normalized (i.e., h(0) = 0), and componentwise increasing (i.e., h(v) < h(w) for v w) function. A resilient equilibrium of ( Pu) is a relaxation u satisfying h(u ) P (u ). (2) The resilient constrained learning problem amounts to solving ( Pu ), i.e., solving ( Pu) for a relaxation u that satisfies (2). We call this equilibrium resilient because it describes how far ( P0) can be relaxed before we start seeing diminishing returns. Indeed, u from (2) is such that relaxing by an additional ϵ 0 would incur in a relaxation cost at least h(u )T ϵ larger, whereas tightening to u ϵ would incur in an optimal value increase of at least the same h(u )T ϵ. Notice that resilient constrained learning is defined in terms of sensitivity. Indeed, the resilient equilibrium in Def. (1) specifies a learning task that is as sensitive to changes in its requirements as it is sensitive to changes in the relaxation cost. This has the marked advantage of being invariant to constant translations of ℓ0, as is also the case for solutions of ( Pu). Sensitivity also measures the difficulty of satisfying a constraint, since P (u) quantifies the impact of each constraint specification on the objective loss. Hence, the equilibrium in (2) has the desirable characteristic of affecting stringent requirements more. Two important properties of the equilibrium in Def. 1 are summarized next (proofs are provided in appendices D.1 and D.2). Proposition 1. Under Ass. 1, the resilient equilibrium (2) exists. If h is strictly convex, it is unique. Proposition 2. Let v, w Rm + be such that [v]i = [w]i, for i = j, and [v]j < [w]j. Under Ass. 1, (i) [ h(v)]j [ h(w)]j and (ii) [ pv]j [ pw]j for all pv P (v) and pw P (w). Prop. 1 shows that the equilibrium in (2) is well-posed. Prop. 2 states that, all things being equal, relaxing the j-th constraint increases the sensitivity of the cost h to it, while simultaneously decreasing its effect on the objective value P . To illustrate these points better, Fig. 1 considers prototypical learning problems with a single constraint [m = 1 in ( Pu)], differentiable P (u), and relaxation cost h(u) = u2/2. According to (2), the resilient relaxations are obtained at h (u ) = u = P (u ), where we let g (u) = dg(u)/du denote the derivative of the function g. As per Prop. 2, h is increasing and P is decreasing. Further observe that the sensitivity P (u) diverges as u approaches the value that makes the problem infeasible and vanishes as the constraint is relaxed. These two curves must therefore intersect, claimed by Prop. 1. The illustrations in Fig. 1 represent progressively more sensitive/difficult problems. In Fig. 1a, the nominal problem (u = 0) is easy to solve (small P (0)), making the resilient equilibrium u a 0. The original problem and the relaxed problem are essentially equivalent. In Fig. 1b, the nominal problem is difficult to solve (large P (0)), inducing a significant change in the constraint and objective losses. In Fig. 1c, the nominal problem is unsolvable, but the resilient relaxation recovers a feasible problem. Having motivated the resilient compromise in Def. 1 and proven that it is well-posed, we proceed to obtain equivalent formulations that show it is also computationally tractable. These formulations are used to show traditional learning tasks that can be seen as resilient learning problems (Sec. 3.3), before deriving a practical algorithm to tackle the resilient learning problem ( Pu ) (Sec. 4). 3.2 Equivalent Formulations While we have shown that the resilient relaxation from Def. 1 exists (Prop. 1), the equilibrium in (2) does not provide a straightforward way to compute it. In this section, we show two more computationally amenable reformulations of Def. 1 by relating u to the Lagrange multipliers of ( Pu) and to the solution of a related optimization problem. Let λ Rm + collect multipliers λi associated to the i-th constraint of ( Pu) and define the Lagrangian L(ϕ, λ; u) = ED0 ℓ0 ϕ(x), y + i=1 λi h EDi ℓi ϕ(x), y ui i . (3) Based on (3), define dual functions g(λ; u) and dual problems D (u) for given constraint level u, D (u) = max λ 0 g(λ; u) = max λ 0 min ϕ F L(ϕ, λ; u). ( Du) While D P (weak duality) in general, there are cases in which D = P (strong duality), e.g., in convex optimization. The constrained ( Pu) is then essentially equivalent to ( Du) that can be tackled by solving an unconstrained, penalized problem (minimizing (3) with respect to ϕ and u), while adapting the weights λi of the penalties (maximizing (3) with respect to λ). This is, in fact, the basis of operation of primal-dual constrained optimization algorithms [39, Chapter 5]. The penalties λ (u) that achieve this equivalence, known as Lagrange multipliers, are solutions of ( Du) and subgradients of the perturbation function. Strong duality also holds under the milder Assumption 1 and a constraint qualification requirement stated next. Assumption 2 . There exist a finite relaxation u and a function ϕ F such that all constraints are met with margin c > 0, i.e., EDi ℓi(ϕ(x), y) ui c, for all i = 1, . . . , m. Note that since u can be large, this requirement is mild. We can thus leverage strong duality to provide an alternative definition of the resilient relaxation in (2). Proposition 3. Let λ (u) be a solution of ( Du) for the relaxation u. Under Assumptions 1 and 2, P (u ) = D (u ) and any u Rm + such that h(u ) = λ (u ) is a resilient relaxation of ( Pu). See appendix D.3 for proof. Whereas Def. 1 specifies the resilience relaxation in terms of the marginal performance gains P , Prop. 3 formulates it in terms of the penalties λ. Indeed, it establishes that the penalty λ (u ) that achieves the resilient relaxation of ( Pu) is encoded in the relaxation cost as h(u ). That is not to say that h directly weighs the requirements as in fixed penalty formulations, since the equilibrium in Prop. 3 also depends on the relative difficulty of satisfying those requirements. Though Prop. 3 does not claim that all resilient relaxations satisfy the Lagrange multiplier relation, this holds under additional continuity conditions on ( Pu) [40]. Prop. 3 already provides a more computationally amenable definition of resilient relaxation, given that Lagrange multipliers are often computed while solving ( Pu), e.g., when using primal-dual methods. Yet, an even more straightforward formulation of Def. 1 is obtained by rearranging (2) as 0 ( P + h)(u ), as we did in the proof of Prop. 1, which suggests that u is related to the minimizers of P + h. This is formalized in the following proposition (see proof in appendix D.4). Proposition 4. A relaxation u satisfies the resilient equilibrium (2) if and only if it is a solution of P R = min ϕ F, u Rm + E(x,y) D0 h ℓ0 ϕ(x), y i + h(u) subject to E(x,y) Di h ℓi ϕ(x), y i ui, i = 1, . . . , m. ( P-RES) The corresponding minimizer ϕ is a resilient solution of the functional learning problem ( Pu). Prop. 4 shows that it is possible to simultaneously find a resilient relaxation u and solve the corresponding resilient learning problem. Indeed, a resilient solution of a constrained learning problem can be obtained by incorporating the relaxation cost in its objective. This is reminiscent of first-phase solvers found in interior-point methods used to tackle convex optimization problems [39, Chap. 11]. Note, once again, that this is not the same as directly adding the constraints in the objective as penalties or regularizations. Indeed, recall from Def. 1 that the resilient relaxation balances the marginal effects of h on P and not ℓ0. Before using Prop. 4 to introduce a practical resilient learning algorithm and its approximation and generalization properties (Sec. 4), we use these equivalent formulations to relate resilient learning to classical learning tasks. 3.3 Relation to Classical Learning Tasks (Un)constrained learning: Both traditional unconstrained and constrained learning can be seen as limiting cases of resilient learning. Indeed, if h 0, u has no effect on the objective of ( P-RES). We can then take ui = B for i = 1, . . . , m, which reduces ( P-RES) to an unconstrained learning problem (recall that all losses are [ B, B]-valued). On the other hand, if h is the indicator function of the non-negative orthant (i.e., h(u) = 0 for u 0 and h(u) = , otherwise), then it must be that u = 0 in ( P-RES) as long as this specification is feasible. Neither of these relaxation costs satisfy the conditions from Def. 1, since they are not componentwise increasing or differentiable, respectively. Still, there exists valid relaxation costs that approximate these problems arbitrarily well (see Appendix D.5). Penalty-based methods: Rather than the constrained formulations in (P) or ( Pu), requirements are often incorporated directly into the objective of learning tasks using fixed penalties as in minimize ϕ F ED0 ℓ0 ϕ(x), y + i=1 γi EDi ℓi ϕ(x), y , (4) where the fixed γi > 0 represent the relative importance of the requirements. It is immediate from Prop. 3 that resilient learning with a linear relaxation cost h(u) = P i γiui is equivalent to (4) as long as EDi ℓi ϕ (x), y 0. From Def. 1, this is the same as fixing the marginal effect of relaxations on the perturbation function. Soft-margin SVM: Linear relaxation costs are also found in soft-margin SVM formulations, namely minimize θ Θ, u Rm + 1 2 θ 2 + γ subject to 1 yiθT xi ui, i = 1, . . . , m. Though written here in its parametrized form, the hypothesis class underlying (PI) (namely, linear classifiers) is convex. It can therefore be seen as an instance of ( P-RES) where each loss ℓi represent a classification requirement on an individual sample. Soft-margin SVM is therefore a resilient learning problem as opposed to its hard-margin version, where ui = 0 for i = 1, . . . , m. 4 Resilient Constrained Learning Algorithm We defined the resilient equilibrium u in (2) and the equivalent resilient learning problems ( Pu ) and ( P-RES) in the context of the convex functional space F. Contrary to (Pu), that are defined on the finite dimensional Fθ, these problems are not amenable to numerical solutions. Nevertheless, we have argued that as long as Fθ is a good approximation of F, the values of (Pu) and ( Pu) are close. We use this idea to obtain a practical primal-dual algorithm (Alg. 1) to approximate solutions Algorithm 1 Resilient Constrained Learning (η, ηu, ηλ > 0; θ(0) Θ; λ(0), u(0) Rm +) for t = 1, . . . , T: θ1 = θ(t 1) θn+1 = θn η θ h ℓ0 fθn(xn,0), yn,0 + Pm i=1 λiℓi fθn(xn,i), yn,i i , n = 1, . . . , N θ(t) = θN+1 u(t) = h u(t 1) ηu h u(t 1) λ(t 1) i + λ(t) i = h λ(t 1) i + ηλ 1 N PN n=1 ℓi fθ(t 1)(xn,i), yn,i u(t 1) i i + , i = 1, . . . , m of ( P-RES) (and thus ( Pu )). The main result of this section (Thm. 1) establishes how good this approximation can be when using only samples from the Di. Explicitly, consider a set of N i.i.d. sample pairs (xn,i, yn,i) drawn from Di and define the parametrized, empirical Lagrangian of the resilient learning problem ( P-RES) as ˆLθ(θ, λ; u) = h(u) + 1 n=1 ℓ0 fθ(xn,0), yn,0 + n=1 ℓi fθ(xn,i), yn,i ui The parametrized, empirical dual problem of ( P-RES) is then given by ˆD R = max λ Rm + min θ Θ,u Rm + ˆLθ(θ, λ; u). (ˆD-RES) The gap between ˆD R and the optimal value P R of the original problem (Pu) can be bounded under the following assumptions. Assumption 3 . The loss functions ℓi, i = 0 . . . m, are M-Lipschitz continuous. Assumption 4 . For every ϕ F, there exists θ Θ such that EDi |ϕ(x) fθ (x)| ν, for all i = 0, . . . , m. Assumption 5 . There exists ξ(N, δ) 0 such that for all i = 0, . . . , m and all θ Θ, EDi ℓi(fθ(x), y) 1 n=1 ℓi fθ(xn,i), yn,i ξ(N, δ) with probability 1 δ over draws of {(xn,i, yn,i)}. Although the parameterized functional space Fθ can be non-convex, as is the case for neural networks, Ass. 4 requires that it is rich in the sense that the distance to its convex hull is bounded. When ξ exists for all δ > 0 and is a decreasing function of N, Ass. (5) describes the familiar uniform convergence from learning theory. Such generalization bounds can be derived based on bounded VC dimension, Rademacher complexity, or algorithmic stability, to name a few [41]. We can now state the main result of this section, whose proof is provided in Appendix C. Theorem 1. Consider P R and u from ( P-RES) and ˆD R from (ˆD-RES). Under Ass. 1 5, it holds with probability of 1 (3m + 2)δ that P R ˆD R h(u + 1 Mν) h(u ) + Mν + (1 + )ξ(N, δ). (6) Thm. 1 suggests that resilient learning problems can be tackled using (ˆD-RES), which can be solved using saddle point methods, as in Alg. 1. Even if the inner minimization problem is non-convex, dual ascent methods can be shown to converge as long as its solution can be well approximated using stochastic gradient descent [38, Thm. 2], as is often the case for overparametrized NNs [42 44]. A more detailed discussion on the algorithm can be found in Appendix E. Next, we showcase its performance and our theoretical results in two learning tasks. 5 Numerical Experiments We now showcase the numerical properties of resilient constrained learning. As illustrative case studies, we consider federated learning under class imbalance and invariance constrained learning. Additional experimental results for both setups are included in Appendix F and H. We also include ablations on the hyperparameters of the method, including dual and perturbation learning rates, and the choice of the perturbation cost function h. 5.1 Heterogeneous Federated Learning Federated learning [45] entails learning a common model in a distributed manner by leveraging data samples from different clients. Usually, average performance across all clients is optimized under the assumption that data from different clients is identically distributed. In practice, heterogeneity in local data distributions can lead to uneven performance across clients [46,47]. Since this may be undesirable, a sensible requirement in this setting is that the loss of the model is similar for all clients. Let Dc be the distribution of data pairs for Client c, and Rc(fθ) = E(x,y) Dc [ℓ(fθ(x), y)] its statistical risk. We denote the average performance as R(fθ) := (1/C) PC i=1 Rc(fθ), where C is the number of clients. As proposed in [18] heterogeneity issues can be tackled by imposing a proximity constraint between the performance of each client Rc, and the loss averaged over all clients R. This leads to the constrained learning problem min θ Θ . R(fθ) (P-FL) s. to Rc(fθ) R(fθ) ϵ 0, c = 1, . . . , C, where ϵ > 0 is a small (fixed) positive scalar. It is ready to see that this problem is of the form in (P); see Appendix F. Note that the constraint in (P-FL) is an asymmetric (and relaxed) version of the equality of error rates common in fairness literature. As shown by [18] this problem can be solved via a primal-dual approach in a privacy-preserving manner and with a negligible communication overhead, that involves sharing dual variables. However, because the heterogeneity of the data distribution across clients is unknown a priori, it can be challenging to specify a single constraint level ϵ for all clients that results in a reasonable trade-off between overall and individual client performance. In addition, the performance of all clients may be significantly reduced by a few clients whose poor performance make the constraint hard to satisfy. The heuristic adopted in [18] is to clip dual variables that exceed a fixed value. In contrast, we propose to solve a resilient version of (P-FL) by including a relaxation uc, c = 1, . . . , C for each constraint and a quadratic relaxation cost h(u) = α u 2 2. The resilient version of problem (P-FL) can also be solved in a privacy preserving manner as long as h(u) is separable in each of the constraint perturbations uc. In that case uc can be updated locally by client c. Following the setup of [18], heterogeneity across clients is generated through class imbalance. More specifically, samples from different classes are distributed among clients using a Dirichlet distribution. Experimental and algorithmic details along with privacy considerations are presented in Appendix F. Constraint relaxation and relative difficulty: Our approach relaxes constraints according to their relative difficulty. In the context of class imbalance, the majority classes exert a stronger influence on the overall objective (average performance). Therefore, the overall performance R tends to favor smaller losses in the majority classes rather than in the minority ones. As a result, meeting the proximity constraint in (P-FL) for clients whose datasets contain a higher fraction of training samples from the minority classes can deteriorate performance. In Figure 2(left), we demonstrate that the constraint is effectively relaxed more for these clients. Controlling the performance vs. relaxation trade-off: Through the choice of the relaxation cost function, we can control the trade-off between relaxing the equalization requirements from (P-FL) and performance. To illustrate this, we perform an ablation on the coefficient α in the quadratic relaxation cost h(u) = α u 2 2. As sown in Figure 2(middle) smaller values of α enable better performance at the cost of larger relaxations. As α increases, the relaxations vanish and the problem approaches the original constrained problem. In this manner, the resilient approach enables navigating this trade-off 0 5 10 15 20 25 Minority Class Samples (%) Perturbation (ui) 2 4 10 20 40 100 1000 Relaxation Cost ( ) Perturbation u 2 Objective R(f ) Train Test Split Constraint Violation Formulation Constrained Resilient Figure 2: (Left) Constraint relaxation and relative difficulty for federated learning under heterogeneous class imbalance across clients (crosses). We plot the perturbation uc against the fraction of the dataset of client c from the minority class, which is associated to how difficult it is to satisfy the constraint, since minority classes typically have higher loss. (Middle) Relaxation cost parameter (h(u) = α u 2 2) vs. final training loss and perturbation norm. (Right) Constraint violations on train and test sets by changing a single hyperparameter. Still, the optimal relaxation for each client is determined by their local data distribution, i.e., the relative difficulty of the constraint. In Appendix F.4 we also perform an ablation on the cost function by changing the perturbation norm. Constraint violation and generalization: Relaxing stringent requirements not only makes the empirical problem easier to solve, but it can also lead to a better empirical approximation of the underlying statistical problem. As shown in Figure 2(right), the resilient approach has a smaller fraction of clients that are infeasible at the end of training. In addition, when constraints are evaluated on the test set, larger constraint violations are observed for some (outlier) clients in the hard constrained approach (uc = 0). In Appendix F.4 we show that this holds across different problem settings. In addition, we also illustrate the fact that this is partly due to generalization issues, i. e., that overly stringent requirements can harm generalization. This observation is in line with the constrained learning theory developed in [7,8]. Comparing client performances: Our method should sacrifice the performance of outliers which represent hard to satisfy constraints in order to benefit the performance of the majority of the clients. In order to showcase this, we sort clients according to their test accuracy Acc[1] Acc[2] . . . Acc[c]. We report the fraction of clients in the resilient method that outperform equally ranked clients for the constrained baseline [18] (Accres [c] Accconst [c] ), as well as the average and maximum increases and decreases in performance. As shown in Table 1, the majority of clients achieve a small increase in performance, while a small fraction experiences a larger decrease. In Appendix F.4 we include plots of ordered accuracies to further illustrate this. Dataset Imbalance Ratio Improved over constrained (%) Max Improvement Max Deterioration 10 77 1.5 10.0 CIFAR10 20 79 2.1 9.1 10 92 4.8 23.3 F-MNIST 20 94 2.6 19.4 Table 1: Changes in accuracy for equally ranked clients for the resilient method compared to the constrained baseline [18]. We report the fraction of equally ranked clients which improved their accuracy, along with the mean and maximum change across all clients that improve (imp.) and decrease (dec.) their accuracy, respectively. Performance improves for most clients, although at the cost of a decrease in performance for a few outliers 5.2 Invariance Constrained Learning As in [21], we impose a constraint on the loss on transformed inputs, as a mean of achieving robustness to transformations which may describe invariances or symmetries of the data. Explicitly, min θ Θ . E(x,y) D [ℓ(fθ(x), y)] (P-IC) s. to E(x,y) D max g Gi ℓ(fθ(gx), y) ϵ 0, i = 1, . . . , m, None Rot. (180) Rot. (90) Scaled Translation Synthetic Invariance Perturbation (u) Transformation Synth. Inv Other Rot. (90) Rot. (180) None Translation Scaled Synthetic Invariance Dual variables ( ) Formulation Constrained Resilient Figure 3: (Left) Constraint Relaxation and Relative difficulty for Synthetically invariant datasets. Bars denote the perturbation ui associated with different transformations, and the x-axis shows the synthetic invariance of the dataset. (Right) Sensitivity (Dual variables) for Synthetically invariant datasets. Bars correspond to the values of dual variables (for all constraints) at the end of training. We compare the resilient and constrained approach accross different synthetic datasets. where Gi is a set of transformations g : X X such as image rotations, scalings, or translations. The constraint can be approximated by sampling augmentations using MCMC methods (see [12,21] for details). However, it is challenging to specify the transformation sets Gi and constraint levels ϵi adequately, since they depend on the invariances of the unknown data distribution and whether the function class is sufficiently expressive to capture these invariances. Therefore, we propose to solve a resilient version of problem (P-IC) using h(u) = α u 2 2 as a perturbation cost. We showcase our approach on datasets with artificial invariances, following the setup of [23]. Explicitly, we generate the synthetic datasets, by applying either rotations, translations, or scalings, to each sample in the MNIST [48] and Fashion MNIST [49] datasets. The transformations are sampled from uniform distributions over the ranges detailed in Appendix H. Constraint relaxation and relative difficulty As shown in Figure 3(left) the relaxations (u) associated with synthetic invariances are considerably smaller than for other transformations. This indicates that when the transformations in the constraint correspond to a true invariance of the dataset, the constraint is easier to satisfy. On the other hand, when the transformations are not associated with synthetic invariances of the dataset, they are harder to satisfy and thus result in larger relaxations. This results in smaller dual variables, as shown in Figure 3 (right). Resilience Improves Performance The resilient approach is able to handle the misspecification of invariance requirements, outperforming in terms of test accuracy both the constrained and unconstrained approaches. We include results for MNIST in Table 2 and F-MNIST in Appendix H. In addition, though it was not designed for that purpose, our approach shows similar performance to the invariance learning method Augerino [22]. Dataset Method Rotated (180) Rotated (90) Translated Scaled Original Augerino 97.78 0.03 96.38 0.00 94.65 0.01 97.53 0.00 98.44 0.00 Unconstrained 94.49 0.12 96.25 0.13 94.64 0.20 97.47 0.03 98.45 0.06 Constrained 94.55 0.18 96.90 0.07 93.74 0.07 97.92 0.15 98.74 0.08 MNIST Resilient 95.38 0.18 97.19 0.09 95.21 0.15 98.20 0.04 98.86 0.02 Table 2: Classification accuracy for synthetically invariant MNIST. We use the same invariance constraint level ϵi = 0.1 for all transformations. We include the invariance learning method Augerino [22] as a baseline. 6 Conclusion This paper introduced a method to specify learning constraints by balancing the marginal decrease in the objective value obtained from relaxation with the marginal increase in a relaxation cost. This resilient equilibrium has the effect of prioritizing the relaxation of constraints that are harder to satisfy. The paper also determined conditions under which this equilibrium exists and provided an algorithm to automatically find it during training, for which approximation and statistical guarantees were derived. Experimental validations showcased the advantages of resilient constrained learning for classification with invariance requirements and federated learning. Future work includes exploring different relaxation costs and applications to robustness against disturbances and outliers. Acknowledgments and Disclosure of Funding The work of A. Ribeiro and I. Hounie is supported by NSF-Simons Mo DL, Award 2031985, NSF AI Institutes program, Award 2112665, and NSF HDR TRipods Award 1934960. The work of Dr. Chamon is supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany s Excellence Strategy (EXC 2075-390740016). [1] Z. Wan, X. Xia, D. Lo, and G. C. Murphy, How does machine learning change software development practices? IEEE Transactions on Software Engineering, vol. 47, no. 9, pp. 1857 1871, 2021. [2] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, A survey on bias and fairness in machine learning, ACM Computing Surveys (CSUR), vol. 54, no. 6, pp. 1 35, 2021. [3] S. H. Silva and P. Najafirad, Opportunities and challenges in deep learning adversarial robustness: A survey, ar Xiv preprint ar Xiv:2007.00753, 2020. [4] Q. Yang, T. D. Simão, S. H. Tindemans, and M. T. Spaan, Safety-constrained reinforcement learning with a distributional safety critic, Machine Learning, vol. 112, no. 3, pp. 859 887, 2023. [5] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel, Fairness through awareness, in Proceedings of the 3rd innovations in theoretical computer science conference, 2012, pp. 214 226. [6] M. Donini, L. Oneto, S. Ben-David, J. S. Shawe-Taylor, and M. Pontil, Empirical risk minimization under fairness constraints, Advances in neural information processing systems, vol. 31, 2018. [7] L. Chamon and A. Ribeiro, Probably approximately correct constrained learning, Advances in Neural Information Processing Systems, vol. 33, 2020. [8] L. F. Chamon, S. Paternain, M. Calvo-Fullana, and A. Ribeiro, Constrained learning with non-convex losses, IEEE Transactions on Information Theory, 2022. [9] S. Paternain, L. Chamon, M. Calvo-Fullana, and A. Ribeiro, Constrained reinforcement learning has zero duality gap, Advances in Neural Information Processing Systems, vol. 32, 2019. [10] C. Tran, F. Fioretto, and P. Van Hentenryck, Differentially private and fair deep learning: A lagrangian dual approach, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 11, 2021, pp. 9932 9939. [11] F. Fioretto, P. Van Hentenryck, T. W. Mak, C. Tran, F. Baldo, and M. Lombardi, Lagrangian duality for constrained deep learning, in Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14 18, 2020, Proceedings, Part V. Springer, 2021, pp. 118 135. [12] A. Robey, L. Chamon, G. J. Pappas, H. Hassani, and A. Ribeiro, Adversarial robustness with semi-infinite constrained learning, Advances in Neural Information Processing Systems, vol. 34, pp. 6198 6215, 2021. [13] A. Robey, G. J. Pappas, and H. Hassani, Model-based domain generalization, Advances in Neural Information Processing Systems, vol. 34, pp. 20 210 20 229, 2021. [14] E. Diana, W. Gill, M. Kearns, K. Kenthapadi, and A. Roth, Minimax group fairness: Algorithms and experiments, in Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 2021, pp. 66 76. [15] Y. Xu, A. Noy, M. Lin, Q. Qian, H. Li, and R. Jin, Wemix: How to better utilize data augmentation, ar Xiv preprint ar Xiv:2010.01267, 2020. [16] L. Peng, P. V. Giampouras, and R. Vidal, The ideal continual learner: An agent that never forgets, ar Xiv preprint ar Xiv:2305.00316, 2023. [17] J. Gallego-Posada, J. Ramirez, A. Erraqabi, Y. Bengio, and S. Lacoste-Julien, Controlled sparsity via constrained optimization or: How i learned to stop tuning penalties and love constraints, 2022. [Online]. Available: https://arxiv.org/abs/2208.04425 [18] Z. Shen, J. Cervino, H. Hassani, and A. Ribeiro, An agnostic approach to federated learning with class imbalance, in International Conference on Learning Representations, 2022. [19] T. Nishio and R. Yonetani, Client selection for federated learning with heterogeneous resources in mobile edge, in ICC 2019-2019 IEEE international conference on communications (ICC). IEEE, 2019, pp. 1 7. [20] F. Yu, W. Zhang, Z. Qin, Z. Xu, D. Wang, C. Liu, Z. Tian, and X. Chen, Heterogeneous federated learning, ar Xiv preprint ar Xiv:2008.06767, 2020. [21] I. Hounie, L. F. O. Chamon, and A. Ribeiro, Automatic data augmentation via invarianceconstrained learning, Ar Xiv, vol. abs/2209.15031, 2022. [22] G. Benton, M. Finzi, P. Izmailov, and A. G. Wilson, Learning invariances in neural networks from training data, Advances in Neural Information Processing Systems, vol. 33, pp. 17 605 17 616, 2020. [23] A. Immer, T. F. A. van der Ouderaa, V. Fortuin, G. Rätsch, and M. van der Wilk, Invariance learning in deep neural networks with differentiable laplace approximations, Co RR, vol. abs/2202.10638, 2022. [Online]. Available: https://arxiv.org/abs/2202.10638 [24] J. Elenter, N. Naderi Alizadeh, and A. Ribeiro, A lagrangian duality approach to active learning, ar Xiv preprint ar Xiv:2202.04108, 2022. [25] N. Quadrianto, V. Sharmanska, and O. Thomas, Discovering fair representations in the data domain, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8227 8236. [26] S. Baharlouei, M. Nouiehed, A. Beirami, and M. Razaviyayn, R\ enyi fair inference, ar Xiv preprint ar Xiv:1906.12005, 2019. [27] S. Jung, D. Lee, T. Park, and T. Moon, Fair feature distillation for visual recognition, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 115 12 124. [28] S. Jung, T. Park, S. Chun, and T. Moon, Re-weighting based group fairness regularization via classwise robust optimization, ar Xiv preprint ar Xiv:2303.00442, 2023. [29] A. Sinha, H. Namkoong, R. Volpi, and J. Duchi, Certifying some distributional robustness with principled adversarial training, ar Xiv preprint ar Xiv:1710.10571, 2017. [30] H. Zhang, Y. Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan, Theoretically principled trade-off between robustness and accuracy, in International Conference on Machine Learning. PMLR, 2019, pp. 7472 7482. [31] A. Agarwal, A. Beygelzimer, M. Dudík, J. Langford, and H. Wallach, A reductions approach to fair classification, in International Conference on Machine Learning. PMLR, 2018, pp. 60 69. [32] M. Donini, L. Oneto, S. Ben-David, J. Shawe-Taylor, and M. Pontil, Empirical risk minimization under fairness constraints, ar Xiv preprint ar Xiv:1802.08626, 2018. [33] M. Kearns, S. Neel, A. Roth, and Z. S. Wu, Preventing fairness gerrymandering: Auditing and learning for subgroup fairness, in International Conference on Machine Learning. PMLR, 2018, pp. 2564 2572. [34] A. Cotter, H. Jiang, M. R. Gupta, S. Wang, T. Narayan, S. You, and K. Sridharan, Optimization with non-differentiable constraints with applications to fairness, recall, churn, and other goals. Journal of Machine Learning Research, vol. 20, no. 172, pp. 1 59, 2019. [35] A. Robey, L. Chamon, G. J. Pappas, and H. Hassani, Probabilistically robust learning: Balancing average and worst-case performance, in Proceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 17 23 Jul 2022, pp. 18 667 18 686. [Online]. Available: https://proceedings.mlr.press/v162/robey22a.html [36] C. S. Holling, Resilience and stability of ecological systems, Annual Review of Ecology, Evolution, and Systematics, vol. 4, pp. 1 23, 1973. [37] , Engineering resilience versus ecological resilience, Engineering within ecological constraints, vol. 31, no. 1996, p. 32, 1996. [38] L. F. O. Chamon, S. Paternain, M. Calvo-Fullana, and A. Ribeiro, Constrained learning with non-convex losses, IEEE Transactions on Information Theory, vol. 69, no. 3, pp. 1739 1760, 2023. [39] S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex optimization. Cambridge university press, 2004. [40] J. F. Bonnans and A. Shapiro, Optimization problems with perturbations: A guided tour, SIAM review, vol. 40, no. 2, pp. 228 264, 1998. [41] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine learning. MIT press, 2018. [42] R. Ge, J. D. Lee, and T. Ma, Learning one-hidden-layer neural networks with landscape design, in International Conference on Learning Representations, 2018. [43] A. Brutzkus and A. Globerson, Globally optimal gradient descent for a convnet with gaussian inputs, in International Conference on Machine Learning, 2017, pp. 605 614. [44] M. Soltanolkotabi, A. Javanmard, and J. D. Lee, Theoretical insights into the optimization landscape of over-parameterized shallow neural networks, IEEE Transactions on Information Theory, vol. 65, no. 2, pp. 742 769, 2018. [45] R. Shokri and V. Shmatikov, Privacy-preserving deep learning, in Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, 2015, pp. 1310 1321. [46] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, Federated optimization in heterogeneous networks, Proceedings of Machine learning and systems, vol. 2, pp. 429 450, 2020. [47] H. Wang, Z. Kaplan, D. Niu, and B. Li, Optimizing federated learning on non-iid data with reinforcement learning, in IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE, 2020, pp. 1698 1707. [48] Y. Le Cun, C. Cortes, and C. Burges, Mnist handwritten digit database, ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, vol. 2, 2010. [49] H. Xiao, K. Rasul, and R. Vollgraf, Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017. [50] R. T. Rockafellar, Conjugate duality and optimization. SIAM, 1974. [51] D. P. Bertsekas, Nonlinear programming, Journal of the Operational Research Society, vol. 48, no. 3, pp. 334 334, 1997. [52] D. Bertsekas, Convex optimization theory. Athena Scientific, 2009, vol. 1. [53] R. T. Rockafellar, Convex analysis. Princeton university press, 1997, vol. 11. [54] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, ar Xiv preprint ar Xiv:1412.6980, 2014. [55] S. Boyd, L. Xiao, and A. Mutapcic, Subgradient methods, lecture notes of EE392o, Stanford University, Autumn Quarter, vol. 2004, pp. 2004 2005, 2003. [56] B. Mc Mahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, Communication-efficient learning of deep networks from decentralized data, in Artificial intelligence and statistics. PMLR, 2017, pp. 1273 1282. [57] X. Zhang, M. Hong, S. V. Dhople, W. Yin, and Y. Liu, Fedpd: A federated learning framework with adaptivity to non-iid data, IEEE Transactions on Signal Processing, vol. 69, pp. 6055 6070, 2020. [58] D. A. E. Acar, Y. Zhao, R. M. Navarro, M. Mattina, P. N. Whatmough, and V. Saligrama, Federated learning based on dynamic regularization, ar Xiv preprint ar Xiv:2111.04263, 2021. [59] A. Krizhevsky, Learning multiple layers of features from tiny images, Canadian Institute for Advanced Research, Tech. Rep., 2009. [60] L. Wang, S. Xu, X. Wang, and Q. Zhu, Addressing class imbalance in federated learning, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 11, 2021, pp. 10 165 10 173. A Constrained learnability and relaxations In this section, we illustrate how constraints affect learnability, an issue that can be exacerbated by relaxing their specification either too much or too little. Resilient constrained learning, on the other hand, progressively adapts to the underlying problem, striking a balance between modifying the learning task and improving the objective value that can overcome these issues. Let us begin by stating what we mean by constrained learnability (see [38, Def. 4] for details). Definition 2 (PACC). A hypothesis class Fθ is probably approximately correct constrained (PACC) learnable with respect to the losses {ℓi}i=0,...,m if there exists an algorithm that, for every ϵ, δ (0, 1) and every distribution Di, i = 0, . . . , m, can obtain fθ Fθ using N(ϵ, δ, m) samples from each Di that is, with probability 1 δ, 1) probably approximately optimal, i.e., for P as in (P), E(x,y) D0 ℓ0 fθ(x), y P ϵ, and (7) 2) probably approximately feasible, i.e., E(x,y) Di ℓi fθ(x), y ϵ, for all i 1. (8) Definition 2 is an extension of the probably approximately correct (PAC) framework from classical learning theory to the problem of learning under requirements. Indeed, note that for unconstrained learning problems [m = 0 in (P)], P ED0 ℓ0 fθ(x), y for all θ Θ. In that case, (7) reduces to the classical definition of PAC learnability [41, Def. 2.14]. In general, however, a PACC learner must also satisfy the approximate feasibility condition (8). Constrained learning theory shows that, under mild conditions, the PAC learnability of Fθ with respect to each individual loss ℓi implies its PACC learnability [38, Thm. 1]. This does not mean that an empirical version of (P) is a PACC learner, i.e., approximates the value of P . In fact, this is typically not the the case. Case in point, consider the learning problem P e = min θ Θ J(θ) ED0 |θ x| subject to ED1 yθ x 1, ED2 yθ x 1, (PII) where D0 is the distribution of (x, y) = [τ, τ]T , 1 , with prob. 1/2 [0, α]T , 1 , with prob. 1/2 , D1 is such that (x, y) = ([ 1, τ], 1), and D2 is such that (x, y) = ([ τ, 1], 1), where α is drawn uniformly at random from [0, 1/4] and τ is drawn uniformly at random from [ 1/2, 1/2]. Notice that the Di are correlated through the random variable τ. The parameters θ = [θ1, θ2] are taken from the finite set Θ = θa, θb, θc, θd , described below together with their corresponding objective values: 1/32, θ = θa [1/2, 1/2]T 1/16, θ = θb [1, 1]T 1/16 + 1/24, θ = θc [1, 1/3]T 1/8, θ = θd [1, 0]T Notice that under these distributions, the constraints in (PII) reduce to θ1 1 and θ2 1. Hence, all parameters expect for [1/2, 1/2]T are feasible for the statistical (PII). From (9), its optimal value is therefore P e = 1/16 achieved for θ = θb. In the sequel, we will show that (i) the empirical version of (PII) [as in (PIII)] recovers its solution with exponentially small probability (Sec. A.1); (ii) relaxing the empirical problem [as in (PIV)] too little or too much fails to mitigate this issue (Sec. A.2); and (iii) a resilient relaxation [as in (PV)] allows the empirical version of (PII) to recover its solutions with high probability. In summary, the empirical counterpart of (PII) can be relaxed into a PACC learner, but care must be taken when choosing that relaxations. A.1 Empirical constrained risk minimization (ECRM) Consider the empirical version of (9), which can be written as ˆP e = min θ Θ ˆJ(θ) 1 subject to θ1 1 + τθ2, θ2 1 + τθ1, where τ = 1 N PN n=1 τn is the empirical average of i.i.d. samples τn drawn uniformly at random from [ 1/2, 1/2]. Notice that for θa, the first constraint reduces to 1/2 1 + τ/2 and since τ 1/2, it is violated unless τ = 0; and for θb, the constraints read 1 1 + τ and 1 1 + τ, so that at least one is violated unless τ = 0. Since τ is a continuous distribution, τ = 0 almost surely. Immediately, we obtain that ˆP e min( ˆJ(θc), ˆJ(θd)). Additionally, ˆP e < since θd is always feasible. Hence, P |ˆP e P e| 1/64 P | ˆJ(θc) P e| 1/64 | ˆJ(θd) P e| 1/64 P | ˆJ(θc) P e| 1/64 + P | ˆJ(θd) P e| 1/64 P | ˆJ(θc) J(θc)| 1/24 1/64 + P | ˆJ(θd) J(θd)| 1/16 1/64 A.2 Relaxed ECRM Consider now a relaxation of (PIII), namely ˆP x = min θ Θ ˆJ(θ) 1 subject to θ1 1 + τθ2 u1, θ2 1 + τθ1 + u2. Problem (PIV) is the empirical version of (Pu) for the relaxation u = [u1, u2]T . It is straightforward to see that if u1, u2 < τ, then we are in the situation of Sec. A.1 and once again P |ˆP e P e| 1/64 0 as N . On the other hand, notice that the constraints always hold whenever u1 1 + τθ2 θ1 and u2 θ2 τθ1 1: For θa, this reduces to u1 ( τ + 1)/2, since u2 ( τ + 1)/2 is satisfied for all u 0. We therefore obtain that 0 ˆP x ˆJ(θa), for u1 τ + 1 In this case, recalling that τ = 0 almost surely yields P |ˆP x P e| 1/64 = P | ˆJ(θa) 1/16| 1/64 ˆJ(θa) ˆJ(θb) ˆJ(θc) ˆJ(θd) P | ˆJ(θa) 1/16| 1/64 , where we use x y = min(x, y). Using the fact that J(θa) = 1/32, we obtain P |ˆP x P e| 1/64 P | ˆJ(θa) J(θa) 1/32| 1/64 P | ˆJ(θa) J(θa)| 1/64 2e 0.0001N. For θb, this becomes u1 τ and u2 τ. Hence, ˆJ(θa) = ˆP x ˆJ(θb), for τ u1 < τ + 1 2 and u2 τ. Since P e = J(θb), we therefore get P |ˆP x P e| 1/64 = P | ˆJ(θb) J(θb)| 1/64 ˆJ(θb) ˆJ(θc) ˆJ(θd) P ˆJ(θb) ˆJ(θc) ˆJ(θd) P | ˆJ(θb) J(θb)| 1/64 Notice that since ˆJ(θ) concentrates around J(θ) for θ Θ, for every ϵ > 0, there exists N0 such that P ˆJ(θb) ˆJ(θc) ˆJ(θd) 1 ϵ. We therefore conclude that P |ˆP x P e| 1/64 1 ϵ P | ˆJ(θb) J(θb)| 1/64 1 ϵ 2e 0.0004N In summary, we have that P |ˆP x P e| 1/64 = 4e 0.001N, u1, u2 < τ 1 3e Nt2, τ u1 < τ + 1 2e 0.0001N, u1 τ + 1 A.3 Resilient ECRM Finally, consider the resilient version of (PIII) ˆP R = min θ Θ, u Rm + ˆJ(θ) + u 2 subject to θ1 1 + τθ2 u1, θ2 1 + τθ1 + u2 and let ˆθ R be a solution of (PV). Notice that (PV) is an empirical version of the equivalent formulation in Prop. 4. We consider the value of the problem for each possible element of Θ: For θa to be feasible, it must be that u1 ( τ + 1)/2 (u2, on the other hand, can be 0). Hence, we have that ˆP R ˆJ(θa) + ( τ + 1)2 For θb to be feasible, it must be that u1 τ and u2 τ. Hence, ˆP R ˆJ(θb) + τ 2. For θc to be feasible, it must be that u1 τ/3 (once again the second constraint need not be relaxed). Hence, ˆP R ˆJ(θc) + 2 τ 2 Finally, θd is feasible for all u Rm +. Hence, ˆP R ˆJ(θd). Putting together these conditions, we obtain that for θb to be optimal, i.e., for ˆJ(ˆθ R) P e, it must be that ˆJ(θb) + τ 2 ˆJ(θa) + ( τ + 1)2 8 ˆJ(θc) + 2 τ 2 9 ˆJ(θd) (10) From the fact that ˆJ(θ) concentrates around J(θ) for θ Θ (e.g., by Hoeffding s inequality) and using the values of J from (9), it holds that there exists N0 such that θb is optimal for N N0 as long as | τ| 0.2. We therefore conclude that P |J(ˆθ R) P e| 1/64 P ˆθ R = θb P | τ| 0.2 1 2e 0.08N. B Convexity of the Perturbation Function for Nonconvex Losses The definition of resilient equilibrium and all subsequent results in the paper require convexity of the perturbation function P (u) defined in ( Pu). This requisite is stated in Assumption 1. Assumption 1 is well known to hold if the optimization problem ( Pu) is convex, for which it is sufficient for the class F to be a convex set and the losses ℓi to be convex functions (see, e.g., [40]). If the problem is not parametric, as is, e.g., (P), then assuming the hypothesis class F is convex is mild. Consider, e.g., the set F to be the intersection of the space of Di-measurable functions. Overall, it is not a significant restriction to assume that the losses are convex since this is the case for typical losses, such as cross entropy and mean squared error1. This convex losses requirement can be relaxed as long as the distributions are well-behaved and the hypothesis class F is rich enough. We define these conditions in two technical assumptions. In what follows, we consider the distributions Di are defined over the same probability space (X Y, Σ, Pi) for the sigma-algebra Σ and the measures Pi. Assumption 6 (Non-atomic distributions). The set Y is finite and the conditional random variables x | y induced by the Di measures are non-atomic. That is, for every measurable event Z Σ with strictly positive conditional measure Pi,x|y(Z) > 0, there exist an event Σ Z Z strictly included in Z with positive conditional measure mb Pi,x|y(Z ) > 0. Assumption 7 (Decomposable hypothesis class). The set F is decomposable in the sense that for all pair of functions ϕ1, ϕ2 F and all measurable set Z Σ, it holds that ϕ F for ϕ(x) = ϕ1(x), x Z ϕ2(x), x / Z Assumption 6 requires data probability distributions in which all points have zero measure. Assumption 7 requires hypothesis classes that are closed under cutting and stitching of functions. In turns out that Assumptions 6 and 7 are enough to guarantee that the perturbation function is convex. In particular, they imply the following result [38, Lemma A.1]. Lemma 1. Define the cost constraint epigraph of ( Pu) to be the set C = [s0; s] Rm+1 | ϕ F such that EDi [ℓi(ϕ(x), y)] si i = 0, . . . , m . (11) Under Assumptions 6 and 7, the set C is convex. Notice that Lemma 1 does not require the losses ℓi to be convex for the cost-constraint epigraph to be convex. This is a significant fact since it implies that the perturbation function of ( Pu) is also convex, as we show next. Proposition 5. Consider the perturbation function P (u) defined in ( Pu). Under Assumptions 6 and 7, P (u) is convex. Proof. Let v and w be two perturbations of ( Pu) and define the convex combination perturbation ua = av + (1 a)w for some a [0, 1]. To prove that P (u) is convex we need to show that P (ua) a P (v) + (1 a) P (w) for all a. If either v or w yield infeasible problems we have P (v) = or P (w) = . In this case, P (ua) a P (v) + (1 a) P (w) = . In the more interesting case where v and w both yield feasible problems, we have that [ P (v); v] C and [ P (w); w] C, (12) for C defined in (11). Indeed, any ϕ v that solves ( Pv) must satisfy EDi [ℓi(ϕ v(x), y)] ui for i > 0 and ED0 [ℓ0(ϕ v(x), y)] = P (v). Likewise for a function ϕ w that solves ( Pw). Under Assumptions 6 and 7, we can combine (12) and Lemma 1 to get that a[ P (v); v] + (1 a)[ P (w); w] C, for all a [0, 1]. (13) 1It is only when dealing with the parametrized Fθ that the composition of the parametrization with the loss may lead to a convex function of the decision variable θ. For this to hold, the definition of C in (11) implies that there exists a function ϕa F such that E [ℓ0(ϕa(x), y)] a P (v) + (1 a) P (w) and (14a) E [ℓi(ϕa(x), y)] aui + (1 a)vi, for i = 1, . . . , m. (14b) From (14b) we obtain that ϕa is feasible for ( Pua). Hence, it is a suboptimal solution of ( Pua), which implies that P (ua) E [ℓ0(ϕa(x), y)] a P (v) + (1 a) P (w), (15) thus concluding the proof. Proposition 5 shows that the perturbation function P (u) can be convex even when ( Pu) is defined based on non-convex losses. In the context of resilient learning, this is important because it allows us to quantify the variations of the perturbation function, which is the basis for the resilient equilibrium from Def. 1. The convexity of the perturbation function P (u) is also relevant because, under a constraint qualification such as the one stated in Assumption 2, it implies that the optimization problem is strongly dual. We encapsulate this result in the next lemma since it will be used in subsequent derivations. Lemma 2. Suppose P (u) associated to ( Pu) is convex (Assumption 1) and there exists ϕ F strictly feasible for ( P0) (Assumption 2). Then, strong duality holds for ( Pu), i.e., P = D . Proof. The result follows from [50, Thm. 15] since the constraint qualification in Assumption (2) implies that P is lower semi-continuous [50, Thm. 18(a)], i.e., that its epigraph is closed [51, Prop. 1.1.2]. C Proof of Thm. 1 The proof is divided in three steps that transform the resilient functional dual problem ( P-RES) into the resilient, parametrized, empirical dual problem (ˆD-RES). Explicitly, define the Lagrangian of ( P-RES) as LR(ϕ, λ; u) = ED0 ℓ0(ϕ(x), y) + h(u) + i=1 λi EDi ℓi(ϕ(x), y) ui (16) and its dual problem as D R = max λ Rm + min ϕ F, u Rm + LR(ϕ, λ; u). (D-RES) First, from lemma 2 we have that if assumption 1 and assumption 2 hold then the problem is strongly dual, i.e., P R = D R. Then, to relate (D-RES) to its parametrized, empirical version (ˆD-RES), we begin by approximating F by Fθ. Explicitly, we consider the difference between P R and D θ = max λ Rm + min θ Θ, u Rm + LR(fθ, λ; u). (Dθ-RES) This difference, which we call the approximation error, is bounded in the following proposition: Proposition 6. Under Assumptions 1 2, it holds that P R D θ P R + h(u + 1 Mν) h(u ) + Mν, (17) for P R and u as in ( P-RES) and D θ as in (Dθ-RES). The proof of Prop. 6 is deferred to Sec. C.1. We proceed by replacing the expectations in (Dθ-RES) by empirical averages, leading to a generalization error bounded as follows: Proposition 7. Let D θ be the value of the parametrized dual problem (Dθ-RES) achieved for a Lagrange multiplier λ and ˆD R be the value of the empirical, parametrized dual problem (ˆD-RES) achieved for a Lagrange multiplier ˆλ . Under Assumption (5), it holds that D θ ˆD R (1 + )ξ(N, δ) (18) with probability 1 2(m + 1)δ, where = max λ 1, ˆλ 1 . A proof can be found in Sec. C.2. Combining the results in Prop. 6 and 7 using the triangle inequality yields (6) and concludes the proof of Thm. 1. C.1 Bounding the parametrization error: proof of Prop. 6 The lower bound stems directly from the definition of the dual problem in (Dθ-RES). Indeed, it is immediate that D θ min θ Θ, u Rm + LR(fθ, λ; u), for all λ Rm +. (19) In particular, (19) holds for a solution λ of the resilient functional dual problem D-RES. Explicitly, we can write D θ min θ Θ, u Rm + LR(fθ, λ ; u) Since Fθ F, it follows that D θ min θ Θ, u Rm + LR(fθ, λ ; u) min ϕ F, u Rm + LR(ϕ, λ ; u) = D R = P R. (20) The upper bound is obtained by relating the parameterized dual problem (Dθ-RES) to a tightened version of ( P-RES). Start by adding and subtracting LR(ϕ, λ; u) in (16) from the objective of the parametrized dual problem (Dθ-RES) to get LR(fθ, λ; u) = LR(ϕ, λ; u) + LR(fθ, λ; u) LR(ϕ, λ; u) = LR(ϕ, λ; u) + ED0 ℓ0(fθ(x), y) ℓ0(ϕ(x), y) i=1 λi EDi ℓi(fθ(x), y) ℓi(ϕ(x), y) , where the relaxations ui and relaxation costs h(u) canceled out. To proceed, use the Lipschitz continuity of the losses (Assumption 3) to bound the expected loss difference as EDi ℓi fθ(x), y ℓi(ϕ(x), y EDi h ℓi ϕ(x), y ℓi fθ(x), y i MEDi |fθ ϕ(x)| , for i = 0, . . . , m. We can then bound (21) as LR(fθ, λ; u) LR(ϕ, λ; u) + M 1 + max i {0,...,m} EDi |fθ ϕ(x)| . (22) Minimizing (22) over θ yields min θ Θ LR(fθ, λ; u) LR(ϕ, λ; u) + M 1 + min θ Θ max i {0,...,m} EDi |fθ ϕ(x)| . (23) However, from the ν-approximation property of the parametrization (Assumption 4), for every ϕ F there exists θ Θ such that EDi |fθ ϕ(x)| ν, for i = 0, . . . , m. Hence, (23) simplifies to min θ Θ LR(fθ, λ; u) LR(ϕ, λ; u) + 1 + Notice that (24) holds uniformly over ϕ F. Additionally, since the left-hand side does not depend on ϕ, minimizing it over u and ϕ then maximizing with respect to λ recover the dual value from (Dθ-RES). We immediately obtain the upper bound D θ max λ Rm + min ϕ F, u Rm + LR(ϕ, λ; u) + 1 + Expanding LR and rearranging the terms in (25), we recognize the Lagrangian of a perturbed version of ( P-RES), namely P Mν = min ϕ F, u Rm + ED0 h ℓ0 ϕ(x), y i + h(u) + Mν subject to E(x,y) Di h ℓi ϕ(x), y i ui Mν, i = 1, . . . , m. (PVI) Hence, the right-hand side of (25) is the dual problem of (PVI). Since (PVI) is strongly dual (from lemma 2), we obtain that D θ P Mν. To bound P Mν, note that we can construct a feasible pair for (PVI) from any pair (ϕ , u ) that solves the resilient functional learning problem ( P-RES). Explicitly, observe that the pair (ϕ , u + 1 Mν) is a (suboptimal) feasible point of (PVI). Thus, D θ P Mν ED0 h ℓ0 ϕ (x), y i + h(u + 1 Mν) + Mν P R h(u ) + h(u + 1 Mν) + Mν (26) Combining (20) and (26) concludes the proof. C.2 Bounding the generalization error: Proof of Prop. 7 Let λ and ˆλ be solution pairs of (Dθ-RES) and (ˆD-RES) respectively and consider the sets of dual minimizers M(λ) = argmin θ Θ, u Rm + LR(fθ, λ; u) and ˆ M(λ) = argmin θ Θ, u Rm + ˆLθ(θ, λ; u). Note that member of these set are pairs (θ, u). Since ˆλ maximizes the minimum (with respect to θ and u) of the empirical Lagrangian in (5), it holds that minθ Θ, u Rm + ˆLθ(θ, ˆλ ; u) minθ Θ, u Rm + ˆLθ(θ, λ; u) for all λ Rm +. In particular, D θ ˆD R = min θ Θ, u Rm + LR(fθ, λ ; u) min θ Θ, u Rm + ˆLθ(θ, ˆλ ; u) min θ Θ, u Rm + LR(fθ, λ ; u) min θ Θ, u Rm + ˆLθ(θ, λ ; u). Since (ˆθ , ˆu ) ˆ M(λ ) is suboptimal for LR(fθ, λ ; u), we get D θ ˆD R LR(fˆθ , λ ; ˆu ) ˆLθ(ˆθ , λ ; ˆu ). Using a similar argument yields D θ ˆD R LR(fθ , ˆλ ; u ) ˆLθ(θ , ˆλ ; u ) for (θ , u ) M(ˆλ ). We thus obtain D θ ˆD R max LR(fˆθ , λ ; ˆu ) ˆLθ(ˆθ , λ ; ˆu ) , LR(fθ , ˆλ ; u ) ˆLθ(θ , ˆλ ; u ) (27) Using the uniform convergence bound from Assumption 5, we obtain that LR(fθ, λ; u) ˆLθ(θ, λ; u) ξ(N, δ) + i=1 λiξ(N, δ) (1 + λ 1)ξ(N, δ), (28) holds uniformly over θ with probability 1 (m + 1)δ. Combining (27) with (28) using the union bound concludes the proof. D Proofs from Main Text D.1 Proof of Proposition 1 Proof. [Existence] To show there exists u Rm + satisfying Def. 1, notice that (2) is equivalent to showing that there exists u such that 0 f(u ) for f(u) = P (u) + h(u). Notice that f is a convex function, so that its subgradient is well-defined [see (1)]. Start by noticing that the set of minimizers of f is not empty. Explicitly, let U = argminu Rm + f(u). Observe that since the losses ℓi are bounded, P ( u) = P (v) for all v u = B 1. Immediately, we obtain that f(v) is a componentwise increasing function for v u, which implies that U = argminu U f(u), where U = {u Rm + | u u}. Since U is compact and f is convex, its set of minimizers U is compact and non-empty [51, Prop. B.10(b)]. For all u int(U ), where int(Z) denotes the interior of the set Z, it must be that 0 f(u ) [52, Prop. 5.4.7]. This is simply a statement of the first-order optimality condition for u . Suppose, now, that U has no interior and that for all u U , [u ]j = 0 for some j. Then, it could be that for all p f(u ) are such that [p]j > 0. However, this would lead to a contradiction. Indeed, since h is normalized, it holds that [ h(u )]j = 0, so that f(u ) = P (u ). Thus, if all p f(u ) are such that [p]j > 0, then all p P (u ) are such that [p]j > 0. From the convexity of P , this would imply that P (u + ej) P (u ) + [p]j > P (u ), where ej is the j-th element of the canonical basis. This contradicts the fact that P is componentwise non-increasing. It must therefore be that 0 f(u ). [Unicity] If h is strictly convex, then f is strictly convex. Then, its minimizer is unique [53, Chap. 27], i.e., U is a singleton. Suppose there exists u / U that satisfies the equilibrium in Def. 1. Then, from (2), it holds that 0 f(u ). Since f is convex, this implies that u is a minimizer, which violates the hypothesis that it is not in U . D.2 Proposition 2 Proof. From the definition of sub-differential, it holds for any pv P (v) and pw P (w) that P (v) P (w) + p T w(v w) (29) P (w) P (v) + p T v (w v) (30) Adding (29) and (30), we can rearrange the terms to get 0 (pv pw)T (w v) (31) Then, since vi ui > 0 if i = j and 0 otherwise, (31) implies that [pv]j [pw]j 0, which yields the desired result. Applying the same argument for h concludes the proof. D.3 Proposition 3 Proof. The strong duality of ( P-RES) under assumptions 1 and 2 holds due to Lemma 2. The proof then follows the same steps used to prove that optimal Lagrange multipliers are subgradients of the dual function in convex problems; see, e.g., [39, Section 5.6.2]. We just verify that the proof holds despite the non-convexity of the problem as long as the problem is strongly dual. By definition of the dual function we can write P (u) = D (u) = g(λ ; u) = min ϕ L ϕ, λ (u); u L ϕ, λ (u); u (32) where the inequality is true for any function ϕ. We particularize this inequality to a function ϕ ( ; v) that attains the minimum of (Pu) for relaxation v. We can therefore write P (u) L ϕ ( ; v), λ (u); u , (33) We now substitute the definition of the Lagrangian in (3) for the right hand side of (33) to write P (u) ED0 h ℓ0(ϕ (x; v), y) i + i=1 λ i (u) EDi h ℓi(ϕ (x; v), y) i ui The important property to observe in (34) is that we consider perturbation u and corresponding dual variable λ (u) while evaluating the Lagrangian at the function ϕ (x; v) that is primal optimal for perturbation v. This latter fact implies that the following equality and inequalities are true ED0 h ℓ0(ϕ (x; v), y) i = P (v), EDi h ℓi(ϕ (x; v), y) i vi. (35) Using the relationships in (35) and (34) we conclude that P (u) P (v) + i=1 λ i (u) h vi ui i . (36) The two sums in the right hand side of (36) equal the inner product of multiplier λ (u) with (v u). Implementing this substitution in (36) and reordering terms yields P (v) P (u) λ (u)T (v u). (37) Comparing (37) with (1) we see that λ (u) satisfies Definition 1. D.4 Proposition 4 Proof. As we did in the proof of Prop. 1, we begin by showing that u is a minimizer of ( P +h)(u) = P (u) + h(u). Explicitly, u argmin u Rm + ( P + h)(u). ( P-RES ) As shown in proposition 1, u minimizes P + h iff 0 ( P + h)(u ). Then 0 ( P + h)(u ) is also equivalent to the resilient equilibrium definition (2). In ( Pu), the function P (u) is defined as the solution of ( Pu). To show that ( P-RES ) and ( P-RES) are equivalent a nested minimization over variables ϕ and u is equivalent to a joint minimization over ϕ and u. To confirm that this holds here recall the definition of u as the resilient relaxation and of ϕ (x; u) as the minimizer of the relaxed problem ( Pu) which holds for all u and u in particular. Since u is the solution of ( P-RES ). We therefore have that for all perturbations u E(x,y) D0 h ℓ0(ϕ (x; u ), y) i + h(u ) E(x,y) D0 h ℓ0(ϕ (x; u), y) i + h(u). (38) Further observe that ϕ (x; u) is the minimizer of the relaxed problem (Pu) associated with perturbation u. We then have that for any feasible function ϕ we must have E(x,y) D0 h ℓ0(ϕ (x; u), y) i + h(u) E(x,y) D0 h ℓ0(ϕ(x), y) i + h(u). (39) The bound in (39) is true for all ϕ and u. In particular, it is true for the solution ϕ , u of ( P-RES). Particularizing (39) to this pair and combining the resulting inequality with the bound in (38) we conclude that E(x,y) D0 h ℓ0(ϕ (x; u ), y) i + h(u ) E(x,y) D0 h ℓ0(ϕ (x), y) i + h(u ). (40) On the other hand, given that ϕ , u solve ( P-RES) we have that for all feasible u and ϕ E(x,y) D0 h ℓ0(ϕ (x), y) i + h(u ) E(x,y) D0 h ℓ0(ϕ(x), y) i + h(u). (41) In particular, this is true if we make u = u and ϕ = ϕ (x; u ). We can then write, E(x,y) D0 h ℓ0(ϕ (x), y) i + h(u ) E(x,y) D0 h ℓ0(ϕ (x; u ), y) i + h(u ) (42) For (40) and (42) to hold we must have that ϕ , u is a solution of ( P-RES) if and only if u and ϕ = ϕ (x; u ) are a resilient perturbation and a corresponding resilient minimizer. D.5 Proposition 8 Proposition 8. Let Φ = argminϕ F E(x,y) D0 h ℓ0(ϕ(x; u ), y) i denote the solution set of the unconstrained problem and bi = minϕ Φ E(x,y) Di h ℓi(ϕ(x; u ), y) i , i = 1, . . . , m be the minimum constraint losses achieved. We show there exists costs such that u i ϵ and u i bi ϵ for i = 1, . . . , m. First, we will show sufficient conditions for u 1ϵ and u b 1ϵ, in terms of P and h evaluated at 1ϵ and b 1ϵ, respectively. For any u Rm + the strong duality of P (lemma 2) implies that P (u) is non empty and bounded. Also, the convexity of P (assumption 1) implies that for all pu P (u) and pu P (u ) by (31) from proposition (3) it holds that D pu pu, u u E 0 (43) Analogously, due to the convexity of h we have that D h(u ) h(u), u u E 0 (44) Adding (43) and (44) we obtain D [pu + h(u )] [pu + h(u)] , u u E 0 (45) (45) holds for any pu P (u ). Since u achieves the resilient equilibrium, by definition h(u ) = pu P (u ), which implies that for all pu P (u) D [pu + h(u)] , u u E 0 (46) Finally, by choosing particular values of u in (46) we obtain the desired conditions. (u 1ϵ): Letting a = ϵ, (46) implies that if there exists p1ϵ P (1ϵ) such that p(1ϵ) + h(1ϵ) 0 then u 1ϵ. (47) (u 1b 1ϵ ): Letting a = b 1ϵ, (46) implies that if there exists p(b 1ϵ)+ P ((b 1ϵ)+) such that p(b 1ϵ)+ + h((b 1ϵ)+) 0 then u (b 1ϵ)+. (48) Let h(u) = α u 2 2, α R+, which has h(u) = 2αu. We will now show that there exist coefficients α and α such that the associated costs h and h satisfy the conditions (47) and (48). ( h): Let α > p1ϵ h(1ϵ)i = 2 αϵ > [ p1ϵ]i for all i = 1, . . . , m which implies that h satisfies (47) and thus u 1ϵ. ( h): Let α < mini [ p(b 1ϵ)+]i 2(bi ϵ) , then h((b 1ϵ)+)i = 2 α(ui ϵ) < [p1(bi ϵ)]i for all i = 1, . . . , m which implies that h satisfies (48) and thus u b 1ϵ. Since the definition of b and ϵ ensures that mini [ p(b 1ϵ)+]i 2(bi ϵ) > 0, there exists α > 0 that satisfies the above condition. E Primal and Dual Updates In section 4 we introduced the unconstrained problem (ˆD-RES) that approximates our original problem (Pu) as the saddle point of the empirical Lagrangian ˆLθ: ˆLθ(θ, λ; u) = h(u) + 1 n=1 ℓ0 fθ(xn,0), yn,0 + n=1 ℓi fθ(xn,i), yn,i ui As described in Algorithm 1 he saddle point can be obtained by alternating the minimization and the maximization of the Lagrangian with respect to the primal variables and dual variables respectively. We now describe the updates of both primal and dual variables in greater detail. For clarity, we describe the updates using gradient descent and sub-gradient ascent for primal and dual variables, respectively. However, nothing precludes our method from using other optimization algorithms, such as Adam [54] or SGD with nesterov momentum. The optimization is done via gradient descent and u and stochastic gradient descent for the primal variables θ θ(t+1) = θ(t) ηθ ˆ dθ(t) u(t+1) = u(t) ηudu(t), (50) and stochastic subgradient ascent for the dual variables λ(t+1) = λ(t) + ηλ ˆ dλ(t). (51) In order to obtain the gradients of the Lagrangian with respect to the primal variables, observe that the minimization of the Lagrangian can be separated into two parts that each depend on only one primal variable. ˆL(θ, λ, u) = ˆLθ(θ, λ) + ˆLu(λ, u) (52) where ˆLϕ(ϕ, λ) = ˆL(ϕ, λ; 0) and ˆLu(λ, u) is ˆLu(λ, u) = h(u) λ u. (53) The gradient with respect to u is obtained from (53) du = ˆL(λ, u) u = h(u) λ. (54) Solving for the optimal θ is similar to solving a regularized empirical risk minimization problem. The gradient with respect to θ is estimated from ˆL(θ, λ; 0) using a minibatch of B samples as θℓ0(fθ(xn), yn) + i=1 θℓi(fθ(xn), yn) . (55) The constraints evaluated at the optimal Lagrangian minimizers are supergradients of the corresponding Lagrange multipliers [55]. In the same manner, stochastic supergradient ascent is done by updating the dual variable λ with the super-gradient estimated using a batch of B samples. n=1 ℓi(fθ(xn), yn) ui i = 1 . . . m. (56) F Heterogenous Federated Learning Experiments F.1 Formulation In this section, we show that the constrained problem P-FL can be re-written as a constrained learning problem of the form of P, and then introduce its resilient version, Let Di be the distribution of data pairs for Client i, and Ri(fθ) = E(x,y) Di [ℓ(fθ(x), y)] its statistical risk. We denote the average performance as R(fθ) := (1/C) PC i=1 Ri(fθ), where C is the number of clients. As proposed in [18] heterogeneity issues in federated learning can be tackled by imposing a proximity constraint between the performance of each client Ri, and the loss averaged over all clients R. This leads to the constrained learning problem: min θ Θ . R(fθ) (P-FL) s. to Ri(fθ) R(fθ) ϵ 0, i = 1, . . . , C, where C is the number of clients and Di is the distribution of data pairs for client i, and ϵ is a small (fixed) positive scalar. In order to show that problem P-FL can be re-written as a constrained learning problem of the form of P, we introduce a random variable c uniformly distributed over client indices {1, . . . , C}, i.e., P(c = i) = 1 C for all i = 1, . . . , C Then, we construct a mixture distribution D for data pairs such that (x, y)|c = i Di. Then we can re-write the average risk R(fθ) as the expected loss under D: ED [ℓ(fθ(x), y)] = 1 C EDi [ℓ(fθ(x), y)] where the first inequality holds by law of total expectation. Let ℓi(fθ(x), y) = (C1(c = i) 1)ℓ(fθ(x), y) ϵ, where 1(c = i) is the indicator function for client i. In the same manner, we can re-write the i th constraint Ri(fθ) R(fθ) ϵ as the expectation of the loss ℓi under Di: ED [ℓi(fθ(x), y)] = EDi [ℓ(fθ(x), y)] ϵ 1 C EDj [ℓ(fθ(x), y)] = Ri(fθ) R(fθ) ϵ. Then, we can re-write (P-FL) as P = min θ Θ ED h ℓ fθ(x), y i subject to ED h ℓi fθ(x), y i 0, i = 1, . . . , m. (PC-FL ) which is in the form of (P). As motivated in section 5.1, we then propose to solve the resilient version of (P-FL): P = min θ Θ R(fθ) + h(u) (PR-FL) s. to E(x,y) Di [ℓ(fθ(x), y) R(fθ) ϵ] ui, i = 1, . . . , C. In the next section we derive the primal dual algorithm that enables us to tackle (PR-FL) in a distributed, privacy preserving manner. F.2 Resilient FL Algorithm In this section we show that the resilient problem (PR-FL) can be tackled in a distributed and privacy preserving manner, as long as h is additively separable in in each component. Explicitly, we assume that which holds for the quadratic cost used throughout the experimental section. We thus derive a distributed version of the resilient primal-dual algorithm presented in Section 4 (Algorithm 1). We begin by writing the empirical Lagrangian associated to problem (PR-FL): ˆL(θ, λ, u) = h(u) + 1 ni=1 [ℓ(fθ(xni), yni)] (57) ni=1 [ℓ(fθ(xni), yni)] ui 1 C 1 + λi λ 1 ni=1 [ℓ(fθ(xni), yni)] λiui + hi(ui), (59) C PC i=1 λi. Note that in (59) we have used the assumption that the perturbation cost is additively separable. This enables us to update the perturbation for each client locally, and thus the resilient formulation does not incurr in any additional communication costs (with respect to the constrained problem in [18]). That is, clients do not need to communicate ui during optimisation. The server need only compute and communicate to all clients the average dual variable λ and the average loss R(fθ). Therefore, Homomorphic Encryption techniques can be leveraged to compute these averages without revealing the values of individual dual variables λi and local losses to the server. In the same manner, Homomorphic Encryption could be used to aggregate the relaxation cost h(u) = PX i=1 hi(ui) at the server without loss of privacy. θ update. For a fixed λ and u, the minimization of L with respect to θ is equivalent to a re-weighted version of the standard unconstrained FL objective: min θ Θ ˆL(θ, λ, u) min θ Θ 1 C ni=1 [ℓ(fθ(xni), yni)] . Therefore, θ can be updated using any FL solver. We thus refer to the update on θ as a subroutine Client Update {wi}C i=1 , θt θt+1, which can be implemented using, for example Fed AVG [56] or Fed PD [57]. u update. For a fixed λ and θ, the minimization of ˆL with respect to u is can be carried independently by each client, since the objective is additively separable: min u U ˆL(θ, λ, u) min u U i=1 λiui + hi(ui) We can thus employ gradient descent (or any other optimization algorithm) locally at each client. λ update. For a fixed perturbation u and model θ, we perform dual update on λ by taking dual ascent steps of the following equivalent objective, max λ RN + ˆL(θ, λ, u) max λ RN + ni=1 [ℓ(fθ(xni), yni)] R(f(θ)) ui If server then computes and communicates the average of the client s losses R(f(θ)), the updates to each multiplier λi can be made locally by each client as in 56. As stated before, these update steps can be applied in alternated manner to find a solution of the empirical dual problem, as as described in Algorithm 2. Algorithm 2 Resilient Federated Learning Initialize θ(0), λ(0), u(0), and 0 < η 1. for t = 1 . . . T Compute weights: for alli [N], wi = 1 + λi λ, with λ = 1 N PN i=1 λi; Theta Update: θt Client Update {wi}N i=1 , θt 1 ; Perturbation Update: ui(t) = [ui(t 1) ηudui(t 1)]+ Dual Update: λi(t) = h λi(t 1) + ηλ h 1 Ni PNi j=1 ℓ(fθ(xj), yj) ui(t 1) ii F.3 Experimental Setup As in [18] we create hetereogeneity between clients through class imbalance. Unless stated otherwise, we use three minority classes (Class Labels {0, 2, 4}), and simulate the phenomenon of class imbalance by keeping only ρ = 10% of the data data belonging to the minority classes in the training set. We follow the implementation from [58], in which a Dirichlet prior (over the ten classes) is sampled independently for each client. One client at a time, we sample without replacement according to each client s prior. Once a class runs out of samples, the subsequent clients do not own samples of that class. We also create a test set for each client with the same class balance as the train set. Test sets are then sampled without replacement for each client (but with replacement among different clients, i.e., test sets overlap accross clients). We use the same small neural network architecture as in [18,58]. Namely, a CNN model consisting of 2 convolutional layers with 645 5 filters followed by 2 fully connected layers with 384 and 192 neurons. For both the constrained and resilient formulations, we use Fed AVG [56] to update θ, using 5 communication rounds for each primal update, as in [18]. In all experiments we use 100 clients, and all clients participate from each communication round. We run 500 communication rounds in total. We set ϵ = 0.1, dual learning rate ηλ = 0.1, local learning rate ηθ = 5 10 2 and use no data augmentation in all experiments. In the resilient learning formulation, unless stated otherwise, we use a quadratic penalty on the perturbation h(u) = u 2 2 and a perturbation learning rate ηu = 0.1. All of the plots in section 5 correspond to CIFAR10 [59]. Results for Fashion MNIST [49] can be found on F.4.4. F.4 Additional Results F.4.1 Performance Relaxation trade off. As we have already shown, through the choice of the relaxation cost function, we can control the tradeoff between relaxing requirements and performance. Figure 4 shows that changing the ablation on the coefficient α in the quadratic relaxation cost h(u) = α u 2 2 not only results in larger relaxations, but that these larger relaxations effectively lead to clients with higher losses. 2 4 10 20 40 1001000 Relaxation Cost ( ) Worst Client Loss (constraint) Mean Client Loss (objective) Figure 4: Train loss for the worst client and averaged over all clients for different values of the perturbation cost coefficient α. F.4.2 Sensitivity to Problem Specification The resilient approach is less sensitive to the specification of the constraints. Dual variables indicate the sensitivity of the objective with respect to constraint perturbations. As shown by Figure 5 the resilient approach yields smaller dual variables, irespectively of the tolerance ϵ in the constraint specification. In this context, setting the constraint levels a priori requires knowledge about the heterogeneity of the client local distributions, which in our setup depends on the imbalance of the dataset. As shown by figure 6 the resilient approach yields smaller dual variables, irrespectively of the tolerance irrespectively of the percentage of samples of minority classes that are kept, whereas the dual variables grow for more imbalanced and thus harder to satisfy scenarios. 0.002 0.02 0.2 Formulation Constrained Resilient Figure 5: Dual variables after training with respect to constraint specification ϵ. 5 20 30 Minority Class Samples (%) Figure 6: Dual variables at the end of training with respect to the percentage of minority samples ρ kept in the dataset. Smaller ρ means more imbalance, and thus results in harder to satisfy requirements. F.4.3 Constraint violation across setups. Relaxing stringent requirements not only makes the empirical problem easier to solve, but it can also lead to a better empirical approximation of the underlying statistical problem. That is, overly stringent requirements can result not only in large dual variables, but can also harm generalization. Figure 7 shows that more stringent constraint specifications can lead to poor constraint satisfaction in the test set. This improvement in constraint satisfaction is associated with smaller generalization gaps for the constraints were observed for the resilient approach, as shown in Figure 8. F.4.4 Test Accuracy. In this section we aim to give quantitative performance metrics. We include a constrained baseline [18] and another method for federated learning under class imbalance [60] for comparison. We first compare approaches in terms of the objective, by averaging the test accuracy over all clients. As shown in Table 3, the average accuracy of the resilient approach is overall similar to the constrained approach and the baseline [60]. In order to asses how the distribution o performance among clients varies we also report spread metrics for the test accuracy across clients. As shown in Table 3, Resilient learning has (generally) less spread in the interquartile range and higher maximum spread than its constrained counterpart. F.4.5 Comparing client performances We sort clients according to their test accuracy Acc[1] Acc[2] . . . Acc[c]. We plot the accuracy of equally ranked clients for the constrained baseline [18] (Accres [c] Accconst [c] ). As shown in Figures 9 and 10, in all setups the majority of clients achieve a small increase in performance, while a small fraction experiences a larger decrease. For higher imbalance ratios, this is more pronounced. 0.002 0.02 0.2 Constraint Violation (Test) Formulation Constrained Resilient 5 10 20 30 Minority Class Samples (%) Constraint Violation (Test) Figure 7: Constraint violation (evaluated no the test set) for Resilient and constrained learning using different (left) constraint specifications ϵ, and (right) fraction of minority classes. 0.002 0.02 0.2 0.0 Largest Constraint Generalization error Figure 8: Largest generalization error across clients for different constraint specifications ϵ. F.5 Ablation on perturbation cost function h. In all experiments, we have used a quadratic perturbation cost function h = α u 2 2 In order to assess the impact of the choice of cost the function, we run our algorithm using as a cost u β with β = 1, 2, 4, , for fashion MNIST in both the federated and invariant setting. As shown in Table 4, both the mean and spread statistics are similar for β = 1, 2, 4, i.e. our approach is not overly sensitive to the choice of cost function in this experimental setup. However, β = does show a substantially different behaviour. Penalizing the only the largest perturbation results in a smaller range while attaining a higher IQR and lower mean accuracy. That is, it reduces the worst client test accuracy, but the performance of most clients deteriorates, both in terms of its average and spread. The infinity norm is related to Rawlsian (i.e. minimax) formulations which have been proposed in the context of fairness see e.g. [14]. G Ablation on dual and resilient learning rates We also perform an ablation on ηu, ηλ, the perturbation and dual learning rates, over a small grid of 12 values. In this particular setup, we find that the performance of the algorithm is not overly sensitive to this choice. We also observe that that the rates that were used in the paper (ηu = 0.1, ηλ = 2) are not optimal in this case, and thus further improvements in performance could be obtained through Dataset Imb. Ratio Mean IQR Range [60] [18] Ours [60] [18] Ours [60] [18] Ours 10 92.5 92.6 93.4 2.2 3.5 2.4 64.1 28.6 50.6 F-MNIST 20 94 93.8 94.4 2.2 2.7 1.7 50.2 28.6 45.8 10 81.3 81.5 81.5 8.1 8.7 8.7 49.4 35.8 46.4 CIFAR10 20 83.4 82.4 82.6 8.6 8.4 7.9 44.5 32 41.1 Table 3: Client accuracy spread metrics for different setups. IQR denotes interquantile range and range denotes the maximum minus the minimum accuracy, both computed across 100 clients. 0 20 40 60 80 100 CIFAR10, Imbalance Ratio:10 resilient constrained Decrease Improvement 0 20 40 60 80 100 CIFAR10, Imbalance Ratio:20 resilient constrained Decrease Improvement Figure 9: Test Accuracy for equally ranked clients for the resilient and constrained baseline [18] in CIFAR10. 0 20 40 60 80 100 FASHION-MNIST, Imbalance Ratio:10 resilient constrained Decrease Improvement 0 20 40 60 80 100 50 FASHION-MNIST, Imbalance Ratio:20 resilient constrained Decrease Improvement Figure 10: Test Accuracy for equally ranked clients for the resilient and constrained baseline [18] in CIFAR10. more extensive hyperparameter tuning. However, the main aim of our numerical experiments is to validate empirically the properties of our approach. H Invariance Constrained Learning Experiments H.1 Experimental setup. We showcase our approach on datasets with artificial invariances, following the setup of [23]. Explicitly, we generate the synthetic datasets, by applying either rotations, translations or scalings, to each sample in the MNIST [48] and Fashion MNIST [49] datasets. The transformations are sampled from uniform distributions over the ranges detailed in Table 6. We use the same MLP and CNN architectures and hyperparameters as [23], except that we use only 6 augmentations per sample instead of 31 during training. For each transformation set (rotations, translations and scalings), we constraint the expected loss over samples augmented with the transformations sampled from uniform distributions over the ranges detailed in Table 7. Note that there is a mismatch between the distribution used to generate the β Mean Acc. IQR Max Range 1.0 93.3 2.5 49.5 2.0 93.4 2.4 50.6 4.0 93.4 2.6 45.6 92.7 3.8 28.1 Table 4: Cost function ablation for the heterogeneous federated learning in fashion-MNIST using 100 clients, 3 minority classes, an imbalance ratio of ρ = 10 and dirichlet allocation with parameter d = 0.3. We report the mean, interquartile range and range (maximum value minus minimum value) test accuracy across clients. ηu\ηλ 0.1 0.5 1 2 0.1 81.4 81.6 81.8 81.8 0.5 81.4 80.9 81.4 81.1 1 81.6 81.6 81.6 81.7 Table 5: Dual and resilient learning rate ablation in Heterogenous federated learning setting. We report mean Test Accuracy for CIFAR100 using 100 clients, 3 minority classes, an imbalance ratio of ρ = 10 and dirichlet allocation with parameter d = 0.3. data and that used in the constraints. That is, except for the fully rotated dataset, the constraints are larger than the true transformation range used to construct the synthetic dataset (Table 6).The purpose of this experiments is to showcase that the resilient approach can relax constraints associated to transformation sets that do not describe symmetries or invariances of the data and can thus hinder performance. We use the same transformation sets and constraint specification (ϵ) for all synthetic datasets. Synthetic invariance Parameter Distribution Full Rotation Angle in radians. U π Partial Rotation Angle in radians. U[ π, π] Translation Translation in pixels. U[ 8, 8]2 Scale Exponential Scaling factor. U[ log(2), log(2)] Table 6: Sampling parameters for transformations used to obtain synthetically invariant datasets, from [23] Constraint Set Parameter Range Rotations Angle in radians. [ π, π] Translation Translation in pixels. [ 16, 16]2 Scale Exponential Scaling factor. [ 1.5, 1.5] Table 7: Transformation sets Gi used as invariance constraints. All sets are used simultanously, with the same constraint level (ϵi) for all datasets (0.1). H.2 Results H.2.1 Ablation on perturbation cost. As in the case of federated learning (presented in section 4) perform an ablation on the cost function h using u β with β = 1, 2, 4, . As shown in table 8, β = 1 showed slightly better performance in terms of average test accuracy across all invariant settings. As discussed in Section 3.3, β = 1 recovers penalty based methods, i.e. it is equivalent to setting a fixed penalization coefficient. Nonetheless, the dynamics of our algorithm are different and can thus lead to different solutions. β Partially Rotated Translated Scaled 1.0 86.08 0.38 86.85 0.20 85.02 0.46 2.0 85.37 0.17 86.65 0.25 84.92 0.39 4.0 85.23 0.20 86.64 0.11 84.65 0.68 82.94 0.17 85.47 0.19 83.35 0.49 Table 8: Cost function ablation for invariant fashion-MNIST datasets. We compute the mean and standard deviation in test accuracy across three independent runs. H.2.2 Performance on F-MNIST The resilient approach is able to handle the misspecification of invariance requirements, outperforming in terms of test accuracy both the constrained and unconstrained approaches. In addition, though it was not designed for that purpose, our approach shows similar performance to the invariance learning method Augerino [22]. Dataset Method Rotated (180) Rotated (90) Translated Scaled Original Augerino 85.28 0.54 81.48 0.49 81.13 0.77 83.17 0.46 90.09 0.20 Unconstrained 77.94 0.06 81.57 0.36 79.23 0.17 82.99 0.18 90.20 0.23 Constrained 84.96 0.12 85.66 0.32 83.61 0.10 86.49 0.09 91.02 0.02 F-MNIST Resilient 85.57 0.26 86.48 0.15 85.06 0.23 87.26 0.14 91.55 0.31 Table 9: Classification accuracy for synthetically invariant F-MNIST. We use the same invariance constraint level ϵi = 0.1 for all datasets and transformations. We include the invariance learning method Augerino [22] as a baseline.