# responsible_ai_rai_games_and_ensembles__ebe24188.pdf Responsible AI (RAI) Games and Ensembles Yash Gupta Carnegie Mellon University yashgup2@cs.cmu.edu Runtian Zhai Carnegie Mellon University rzhai@cs.cmu.edu Arun Suggala Google Research arunss@google.com Pradeep Ravikumar Carnegie Mellon University pradeepr@cs.cmu.edu Several recent works have studied the societal effects of AI; these include issues such as fairness, robustness, and safety. In many of these objectives, a learner seeks to minimize its worst-case loss over a set of predefined distributions (known as uncertainty sets), with usual examples being perturbed versions of the empirical distribution. In other words, aforementioned problems can be written as min-max problems over these uncertainty sets. In this work, we provide a general framework for studying these problems, which we refer to as Responsible AI (RAI) games. We provide two classes of algorithms for solving these games: (a) game-play based algorithms, and (b) greedy stagewise estimation algorithms. The former class is motivated by online learning and game theory, whereas the latter class is motivated by the classical statistical literature on boosting, and regression. We empirically demonstrate the applicability and competitive performance of our techniques for solving several RAI problems, particularly around subpopulation shift. 1 Introduction In recent years, AI is increasingly being used in high-stakes decision-making contexts such as hiring, criminal justice, and healthcare. Given the impact these decisions can have on people s lives, it is important to ensure these AI systems have beneficial social effects. An emerging line of work has attempted to formalize such desiderata ranging over ethics, fairness, train-time robustness, test-time or adversarial robustness, and safety, among others. Each of these forms rich sub-fields with disparate desiderata, which are sometimes collated under the umbrella of responsible AI . Many organizations are increasingly advocating the use of responsible AI models [Microsoft, 2021, Google, 2020]. But how do we do so when the majority of recent work around these problems is fragmented and usually focuses on optimizing one of these aspects at a time (DRO [Namkoong and Duchi, 2017, Duchi and Namkoong, 2018], GDRO [Sagawa et al., 2019], CVa R [Zhai et al., 2021a], Distribution Shift [Hashimoto et al., 2018, Zhai et al., 2021b])? Indeed optimizing for just one of these aspects has even been shown to exhibit adverse effects on the other aspects [Roh et al., 2020]. To address this, we study a general framework that is broadly applicable across many of the settings above, and which we refer to as Responsible AI (RAI) games. Our starting point is the recent understanding of a unifying theme in many of these disparate problems, that a learner seeks to minimize its worst-case loss over a set of predefined distributions. For example, in fairness, we seek to perform well on all sub-groups in the data. In robustness, we aim to design models that are robust to perturbations of the training data or the test distribution. This allows us to set up a zero-sum game between a learner that aims to learn a responsible model and an adversary that aims to prevent the learner from doing so. In The relevant code for this work can be found at https://github.com/yashgupta-7/rai-games 37th Conference on Neural Information Processing Systems (Neur IPS 2023). the general RAI game setting, this is a computationally intractable game that need not even have a Nash equilibrium. To address this computational issue, we study a relaxation of the single predictor RAI game, which we term the ensemble RAI game, which can also be motivated as a linearization of the original RAI game. We note that our framework encompasses not only the responsible AI settings but also the setting of classical boosting. Drawing upon the insights from boosting, we provide boosting-based algorithms for solving responsible AI games. We provide convergence guarantees of our algorithms by relying on the connections between boosting and online convex optimization, two-player gameplay [Arora et al., 2012, Mc Mahan, 2011, Bubeck, 2011]. We also conduct empirical analyses to demonstrate the convergence and utility of our proposed algorithms. Interestingly, the algorithms allow for plug-andplay convenience, with changes in the RAI settings requiring only simple changes to the algorithms. More importantly, we could consider intersections of different responsible AI considerations, which in turn can simply be incorporated into our algorithms. Finally, we also study the population risks of our algorithms in certain important settings. We show a surprising result that for the case of binary classification with the 0/1 loss, the optimal predictor for a large class of RAI games is the same as the Bayes optimal predictor, thus generalizing an emerging line of results demonstrating this for certain specific games [Hu et al., 2018]. Under such settings, solving the RAI game could nonetheless be helpful in finite sample settings (as also demonstrated in our experiments) since the RAI game serves to encode desiderata satisfied by the Bayes optimal classifier. 2 Problem Setup and Background We consider the standard supervised prediction setting, with input random variable X X Rd, output random variable Y Y, and samples S = {(xi, yi)}n i=1 drawn from a distribution Pdata over X Y. Let b Pdata denote the empirical distribution over the samples. We also have a set H of hypothesis functions h : X 7 Y from which we wish to learn the best predictor. We evaluate the goodness of a predictor via a loss function ℓ: Y Y 7 R, which yields the empirical risk: b R(h) = E b Pdata ℓ(h(x), y) where E b Pdata(f(x, y)) = 1 n Pn i=1 f(xi, yi). Apart from having low expected risk, most settings require h to have certain properties, for example, robustness to distribution shift, fairness w.r.t subpopulations, superior tail performance, resistance to adversarial attacks, robustness in the presence of outliers, etc. We cast all these subproblems into an umbrella term Responsible AI . Each of these properties has been studied extensively in recent works, albeit individually. In this work, we attempt to provide a general framework to study these problems. 2.1 Related Work We draw our unified framework from seminal works over the past decade by responsible AI researchers on devising non-expected risk objectives, particularly min-max problems, to ensure ML models are responsible. These have resulted in a multitude of different objectives (even for a single responsible AI desideratum such as fairness), and also multiple different sub-communities (so that fairness and multiple disparate robustness communities are relatively fractured), many (if not all) of which we combine within a single umbrella. There is emerging work on relating worst-case performance to invariance [Bühlmann, 2018]; in other words, we might be able to get approximate group invariance via minimizing an appropriately constructed worst-group risk and vice-versa. RAI aspects as constraints. Many prior works have enforced robustness as a constrained optimization [Shafieezadeh-Abadeh et al., 2015, Gao and Kleywegt, 2022, Namkoong and Duchi, 2016, Ben-Tal et al., 2011]. There have also been few prior works enforcing fairness constraints [Mandal et al., 2020]. To the best of our knowledge, there exists minimal prior work focusing on multiple desiderata at once in this regard. Multi Objective Optimization. Several works have considered a multi-objective view of ensuring fairness in classifiers [Martinez et al., 2020, Oneto et al., 2018]. If used for multiple RAI objectives, there is usually overhead in choosing a model that achieves a good trade-off between various losses. Also, it is difficult to guarantee that the solution is robust to any of the involved aspects. Our framework guarantees a certain level of performance on each of the RAI aspects under consideration. Distribution shift. [Koh et al., 2021] classifies distribution shift problems into two categories: Domain generalization, and subpopulation shift. In this work, we focus on the subpopulation shift problem, where the target distribution is absolutely continuous to the source distribution. It has two main applications: fairness [Hashimoto et al., 2018, Hu et al., 2018, Sagawa et al., 2019, Zhai et al., 2021b] and long-tail learning (i.e. learning on class-imbalanced datasets) [Cao et al., 2019, Menon et al., 2021, Kini et al., 2021]. Distributionally Robust Optimization (DRO). In DRO one aims to study classifiers that are robust to deviations of the data distribution. DRO has been studied under various uncertainty sets including f-divergence based uncertainty sets [Namkoong and Duchi, 2017, Duchi and Namkoong, 2018, Sagawa et al., 2019], Wasserstein uncertainty sets [Sinha et al., 2017, Gao et al., 2022], Maximum Mean Discrepancy uncertainty sets [Staib and Jegelka, 2019], more general uncertainty sets in the RKHS space [Zhu et al., 2020]. [Li et al., 2021a] evaluate model performance under worst-case subpopulations. Owing to its importance, several recent works have provided efficient algorithms for solving the DRO objective [Namkoong and Duchi, 2016, Qi et al., 2020, Kumar et al., 2023, Jin et al., 2021]. However, a lot of these techniques are specific to particular perturbation sets and are not directly applicable to the more general framework we consider in our work. Furthermore, in our work, we aim to learn an ensemble of models instead of a single model. Boosting. Classical boosting aims to improve the performance of a weak learner by combining multiple weak classifiers to produce a strong classifier [Breiman, 1999, Friedman et al., 2000, Friedman, 2001, Freund and Schapire, 1995, Freund et al., 1996, Mason et al., 2000]. Over the years, a number of practical algorithms have been introduced such as Ada Boost [Schapire, 1999], LPBoost [Demiriz et al., 2002], gradient boosting [Mason et al., 1999], XGBoost [Chen and Guestrin, 2016], boosting for adversarial robustness [Zhang et al., 2022], [Meunier et al., 2021], [Balcan et al., 2023], and holistic robustness [Bennouna and Parys, 2022]. The algorithms we develop for RAI games are inspired by these algorithms. Fairness. There are a number of fairness notions for algorithmic fairness, ranging from individual fairness [Dwork et al., 2012, Zemel et al., 2013], group fairness [Hardt et al., 2016a, Zafar et al., 2017], counterfactual fairness [Kusner et al., 2017], Rawlsian max-min fairness [Rawls, 2020, Hashimoto et al., 2018] and others [Barocas et al., 2017, Chouldechova and Roth, 2018, Mehrabi et al., 2021]. Our framework includes the popular notion of minimax group fairness. It doesn t capture other notions of group fairness such as Demographic Parity, Equality of Odds, Equality of Opportunity. Population RAI Risks. Several recent works have studied properties of the population risks arising in various responsible AI scenarios. Hu et al. [2018] showed that the minimizer of population DRO risk (under general f-divergences) is the classical Bayes optimal classifier. Li et al. [2021b], Duchi and Namkoong [2018], Sinha et al. [2017] provided generalization guarantees for DRO risk under various divergence families ranging from f-divergences to Wasserstein perturbations. 3 RAI Games In many cases, we do not wish to compute an unweighted average over training samples; due to reasons of noise, tail risk, robustness, and fairness, among many other responsible AI considerations. Definition 1 (RAI Risks) Given a set of samples {(xi, yi)}n i=1, we define the class of empirical RAI risks (for Responsible AI risks) as: b RWn(h) = supw Wn Ew(h(x), y), where Wn n, is some set of sample weights (a.k.a uncertainty set), and Ew(f(x, y)) = Pn i=1 wif(xi, yi). Various choices of Wn give rise to various RAI risks. Table 1 presents examples of RAI risks that are popular in ML. Interestingly, classical problems such as boosting are special cases of RAI risks. In this work, we rely on this connection to design boosting-inspired algorithms for minimizing RAI risks. More choices for Wn can be obtained by combining the one s specified in Table 1 using union, intersection, convex-combination operations. For example, if one wants models that are fair to certain pre-specified groups, and at the same-time achieve good tail-risk, then one could choose Wn to be the intersection of Group-DRO and α-CVa R uncertainty sets. Given the empirical RAI risk b RWn(h) of a hypothesis, and set of hypotheses H, we naturally wish to obtain the hypothesis that minimizes the empirical RAI risk: minh H b RWn(h). This can be seen as solving a zero-sum game. Definition 2 (RAI Games) Given a set of hypothesis H, and a RAI sample weight set Wn, the class of RAI games is given as: minh H maxw Wn Ew(h(x), y). We thus study RAI Games for the special cases above and for an arbitrary constraint set Wn. Name Wn Description Empirical Risk Minimization { b Pdata} object of focus in most of ML/AI Worst Case Margin n, entire probability simplex used for designing margin-boosting algorithms [Warmuth et al., 2006, Bartlett et al., 1998] Soft Margin {w : KL(w|| b Pdata) ρn} used in the design of Ada Boost [Freund and Schapire, 1995] α-Conditional Value at Risk (CVa R) {w : w n, w 1 αn} used in fairness [Zhai et al., 2021a, Sagawa et al., 2019] Distributionally Robust Optimization (DRO) {w : D(w|| b Pdata) ρn} various choices for D have been studied f-divergence [Duchi and Namkoong, 2018] Group DRO { b Pdata(G1), b Pdata(G2), . . . b Pdata(GK)} b Pdata(Gi) is dist. of ith group used in group fairness, agnostic federated learning [Mohri et al., 2019] Table 1: Various ML/AI problems that fall under the umbrella of RAI risks. 4 Ensemble RAI Games In this section, we begin our discussion about ensembles. In general, a statistical caveat with Definition 2 is that good worst-case performance over the sample weight set Wn is generally harder, and for a simpler set of hypotheses H, there may not exist h H that can achieve such good worst-case performance. Thus it is natural to consider deterministic ensemble models over H, which effectively gives us more powerful hypothesis classes. Let us first define RAI risk for such classifiers. Definition 3 (Deterministic Ensemble) Consider the problem of classification, where Y is a discrete set. Given a hypothesis class H, a deterministic ensemble is specified by some distribution Q H, and is given by: hdet;Q(x) = arg maxy Y Eh QI[h(x) = y]. Correspondingly, we can write the deterministic ensemble RAI risk as b RWn(hdet;Q(x)) = maxw Wn Ewℓ(hdet;Q(x), y). We discuss alternative definitions of deterministic ensembles in the Appendix. This admits a class of deterministic RAI games: Definition 4 (Deterministic Ensemble RAI Games) Given a set of hypothesis H, a RAI sample weight set Wn, the class of RAI games for deterministic ensembles over H is given as: min Q H max w Wn Ewℓ(hdet;Q(x), y). However, the aforementioned game is computationally less amenable because of the non-smooth nature of de-randomized predictions. Moreover, there are some broader challenges with RAI games given by Definitions 2 and 4. Firstly, they need not have a Nash Equilibrium (NE), and in general, their min-max and max-min game values need not coincide. This poses challenges in solving the games efficiently. Next, in some cases, directly optimizing over the worst-case performance might not even be useful. For instance, [Hu et al., 2016, Zhai et al., 2021a] show the pessimistic result that for classification tasks where when models are evaluated by the zero-one loss, ERM achieves the lowest possible DRO loss defined by some f-divergence or the α-CVa R loss, given that the model is deterministic. To this end, we consider the following randomized ensemble: Definition 5 (Randomized Ensemble) Given a hypothesis class H, a randomized ensemble is specified by some distribution Q H, and is given by: P[hrand;Q(x) = y] = Eh QI[h(x) = y]. Similarly, we can define its corresponding randomized ensemble RAI risk: b Rrand;Wn(Q) = maxw Wn Eh QEwℓ(h(x), y). We can then also define the class of ensemble RAI games: Definition 6 (Randomized Ensemble RAI Games) Given a set of hypothesis H, a RAI sample weight set Wn, the class of mixed RAI games is given as: min Q H max w Wn Eh QEwℓ(h(x), y). (1) This is a much better class of zero-sum games: it is linear in both the hypothesis distribution P, as well as the sample weights w, and if the sample weight set Wn is convex, is a convex-concave game. As shown below, under some mild conditions, this game has a Nash equilibrium which can be well approximated via efficient algorithms. Proposition 1 Let H be parameterized by θ Θ Rp, for convex, compact set Θ and let Wn be a convex, compact set. Then min Q H maxw Wn Eh QEwℓ(h(x), y) = maxw Wn min Q H Eh QEwℓ(h(x), y) The proposition follows as a direct consequence of well known minimax theorems (Appendix D.3). 4.1 Going from Deterministic to Randomized Ensembles To begin, we point out that what we want is a deterministic ensemble rather than a randomized ensemble. In fact, it can be seen that the deterministic ensemble in Definition 3 is a specific derandomization of the randomized ensemble. It is such deterministic ensembles that we usually simply refer to as ensemble predictors. But the RAI risk for the ensemble predictor is NOT equal to the ensemble RAI risk minimized by our desired game in Equation 1 above for randomized ensembles. Thus, the ensemble RAI game might not in general capture the ideal deterministic ensemble. In this section, we study why and when might solving for a random ensemble is meaningful. Binary Classification. Interestingly, for the very specific case of binary classification, we can provide simple relationships between the risks of the randomized and deterministic ensemble. Proposition 2 Consider the setting with Y = { 1, 1}, the zero-one loss ℓ, and Wn = n. Then, b RWn(hdet;Q) = I[ b RWn(hrand;Q) 1/2]. See Appendix E.2 for a simple proof. In this case, we can also relate the existence of a perfect deterministic ensemble ( boostability ) to a weak learning condition on the set of hypotheses. Specifically, suppose H is boostable iff there exists Q H s.t. b RWn(hdet;Q) = 0. From the above proposition this is equivalent to requiring that b RWn(hrand;Q) < 1/2. We thus obtain: inf Q H sup w Wn Ew,Qℓ(h(x), y) < 1/2 sup w Wn inf h H Ewℓ(h(x), y) < 1/2 where the equivalence follows from the min-max theorem and the linearity of the objective in P. The last statement says that for any sample weights w Wn, there exists a hypothesis h H that has w-weighted loss at most 1/2. We can state this as a weak-learning condition on individual hypotheses in H. The above thus shows that for the specific case of Y = { 1, 1}, the zero-one loss ℓ(y, y ) = I[y = y ], and Wn = n, we can relate boostability of H to a weak learning condition on hypothesis within H. General Classification But in general, we do not have simple connections between b RWn(hdet;Q) and b RWn(hrand;Q). All we can guarantee is the following upper bound: Proposition 3 Let γQ = 1/ mini [n] maxy Y PQ[h(xi) = y]. Then, b RWn(hdet;Q) γQ b RWn(hrand;Q). See Appendix E.2 for a simple proof. Corollary 4 For binary classification, we have γP 2 and thus, we recover the well known bound b RWn(hdet;Q) 2 b RWn(hrand;Q) Remark 5 These bounds might be loose in practice. Specifically, for the binary case, if b RWn(hrand;Q) 1 2 then we have b RWn(hdet;Q) = 0. To this end, prior work [Lacasse et al., 2006, Germain et al., 2015, Masegosa et al., 2020] have developed tighter bounds using second-order inequalities. We leave the analyses of these second-order RAI games to future work. As such, we can cast minimizing randomized RAI risk as minimizing an upper bound on the deterministic ensemble RAI risk. Thus, the corresponding randomized RAI game can be cast as a relaxation of the deterministic RAI game. In the sequel, we thus focus on this randomized ensemble RAI game, which we will then use to obtain a deterministic ensemble. Following the bounds above, the corresponding deterministic ensemble risk will be bounded by the randomized ensemble RAI risk 5 Algorithms In this section, we present two algorithms for solving the RAI game in Equation (1). Our first algorithm is motivated from online learning algorithms and the second algorithm is motivated from greedy stepwise algorithms that have been popular for solving many statistical problems such as regression. For simplicity of presentation, we assume H is a finite set. However, our results in the section extend to uncountable sets. 5.1 Methods Game-play. In game play based algorithms, both the min and the max players are engaged in a repeated game against each other. Both players rely on no-regret algorithms to decide their next action. It is well known that such a procedure converges to a mixed NE of the game Cesa-Bianchi and Lugosi [2006]. In this work, we follow a similar strategy to solve the game in Equation (1) (see Algorithm 1 for the pseudocode). In the tth round of our algorithm, the following distribution wt W is computed over the training data points wt argmax w Wn s=1 Ewℓ(hs(x), y) + ηt 1Reg(w) (2) This update is called the Follow-The-Regularized-Leader (FTRL) update. Here, Reg( ) is a strongly concave regularizer and ηt 1 is the regularization strength. One popular choice for Reg( ) is the negative entropy which is given by P i wi log wi. This regularizer is also used by Ada Boost, which is a popular boosting algorithm. In Appendix F.2, we provide analytical expressions for wt for various choices of Wn, Reg( ). We note that the regularizer in the FTRL update ensures the stability of the updates; i.e., it ensures consecutive iterates do not vary too much. This stability is naturally guaranteed when Wn is a strongly convex set (an example of a strongly convex set is the level set of a strongly convex function. See Appendix for a formal definition and more details). Consequently, the regularization strength ηt 1 could be set to 0 in this case, and the algorithm still converges to a NE [Huang et al., 2017]. Once we have wt, a new classifier ht is computed to minimize the weighted loss relative to wt, and added to the ensemble. This update is called the Best Response (BR) update. Learning ht in this way helps us fix past classifiers mistakes, eventually leading to an ensemble with good performance. Algorithm 1 Game play algorithm for solving Equation (1) Input: Training data {(xi, yi)}n i=1, loss function ℓ, constraint set Wn, hypothesis set H, strongly concave regularizer R over Wn, learning rates {ηt}T t=1 1: for t 1 to T do 2: FTRL: wt argmaxw Wn Pt 1 s=1 Ewℓ(hs(x), y) + ηt 1Reg(w) 3: BR: ht argminh H Ewtℓ(h(x), y) 4: end for 5: return P T = 1 T PT t=1 wt, QT = Unif{h1, . . . h T } Greedy. We now take an optimization theoretic viewpoint to design algorithms for Equation (1). Let L(Q) denote the inner maximization problem of (1): L(Q) := maxw Wn Eh QEwℓ(h(x), y). When L(Q) is smooth (this is the case when Wn is a strongly convex set), one could use Frank-Wolfe (FW) to minimize it. The updates of this algorithm are given by Qt (1 αt)Qt 1 + αt G, where G = argmin Q Q, QL(Qt 1) . Here, QL(Qt 1) = argmaxw Wn Eh Qt 1Ewℓ(h(x), y). This algorithm is known to converge to a minimizer of L(Q) at O(1/t) rate [Jaggi, 2013]. When L(Q) is non-smooth, we first need to smooth the objective before performing FW. In this work we perform Moreau smoothing [Parikh et al., 2014], which is given by Lη(Q) = max w Wn Eh QEwℓ(h(x), y) + ηReg(w). (3) Here Reg( ) is a strongly concave regularizer. If Reg( ) is 1-strongly concave, it is well known that Lη(Q) is O(1/η) smooth. Once we have the smoothed objective, we perform FW to find its optimizer (see Algorithm 2 for pseudocode). Relaxing the simplex constraint. We now derive a slightly different algorithm by relaxing the simplex constraint on Q. Using Lagrangian duality we can rewrite min Q H Lη(Q) as the following problem for some λ R min Q 0 Lη(Q) + λ X One interesting observation is that when Wn is the entire simplex and when λ = 1/2, we recover the Ada Boost algorithm. Given the practical success of Ada Boost, we extend it to general Wn. In particular, we set λ = 1/2 and solve the resulting objective using greedy coordinate-descent. The updates of this algorithm are given in Algorithm 2. Remark 6 Algorithm 2 takes the step sizes {αt}T t=1 as input. In practice, one could use line search to figure out the optimal step-sizes, for better performance. Algorithm 2 Greedy algorithms for solving Equation (1) Input: Training data {(xi, yi)}n i=1, loss function ℓ, constraint set Wn, hypothesis set H, strongly concave regularizer R over Wn, regularization strength η, step sizes {αt}T t=1 1: for t 1 to T do 2: Gt = argmin Q Q, QLη(Qt 1) 3: FW: Qt (1 αt)Qt 1 + αt Gt / Gen-Ada Boost: Qt Qt 1 + αt Gt 4: end for 5: return QT We provide convergence rates for the algorithms below: Proposition 7 (Convergence Rates) Let l(h(x), y) [0, 1] h H, (x, y) D and Reg : n R be a 1-strongly concave function w.r.t norm . 1. Let QT be the output returned from running Algorithm 1 or 2 for T iterations. Let DR be a constant S.T. D2 R = maxx,y Wn |Reg(x) Reg(y)|. 1. (Gameplay) If ηt = η, then QT satisfies L(QT ) min Q L(Q) + ηD2 R T + O( 1 2. (Greedy) If line-search is performed for αt, then QT (FW or the Gen-Ada Boost update) satisfies L(QT ) min Q L(Q) + ηD2 R + O( 1 We refer the reader to Appendix F.1 for a simple proof using existing theory on online convex optimization [Mc Mahan, 2011, Jaggi, 2013]. Another useful insight is that Algorithms 1 and 2 are related to each other under special settings as shown by Appendix H.1. Corollary 8 Consider Reg(w) = Pn i=1 wi log wi and l as the zero-one loss. Then, Algorithm 1 and Algorithm 2 (line-search) achieve ϵ approximate NE with ϵ as O q Weak Learning Conditions It might not be practical for H-player to play BR (Step 3: Algorithm 1) or correspondingly, to find the best possible classifier at every round (Step 2: Algorithm 2). Under weak learning conditions, we can indeed achieve (approximate) convergence when we only solve these problems approximately. See Appendix H.2 for more details. 6 Generalization Guarantees In this section, we study the population RAI risk and present generalization bounds which quantify the rates at which empirical RAI risk converges to its population counterpart. 6.1 Population RAI Games Recall, the empirical RAI risk optimizes over all sample re-weightings w Wn that lie within the probability simplex n. Thus it s population counterpart optimizes over distributions P that are absolutely continuous with respect to the data distribution Pdata: RW (h) = sup P :P Pdata, d P d Pdata W EP [ℓ(h(x), y)]. Following [Shapiro et al., 2021], we can rewrite this as follows. Suppose we use Z = (X, Y ) Z := X Y, so that P, Pdata are distributions over Z. We then define ℓh : Z 7 R as ℓh(z) = ℓ(h(x), y). We can then write the population RAI risk as (see Appendix G for a proof): RW (h) = sup r:Z7 R+, R r(z)d Pdata(z)=1,r W EPdata[r(z)ℓh(z)]. (4) For classification, we define the RAI-Bayes optimal classifier as: Q W = arg min Q RW (Q). Here, the minimum is w.r.t the set of all measurable classifiers (both deterministic and random). This is the target classifier we wish to learn given finite samples. Note that this might not be the same as the vanilla Bayes optimal classifier: Q = arg min Q E[ b R(Q)], which only minimizes the expected loss, and hence may not satisfactorily address RAI considerations. We now try to characterize the RAI-Bayes optimal classifier. However, doing this requires a bit more structure on W. So, in the sequel, we consider constraint sets of the following form: W = r : Z 7 R+ : Z gi(r(z))d Pdata(z) ci, i [m] , (5) where we assume that gi : R+ 7 R, i [m] are convex. Note that this choice of W encompasses a broad range of RAI games including DRO with f-divergence, CVa R, soft-margin uncertainty sets. Perhaps surprisingly, the following proposition shows that the minimizer of population RAI risk is nothing but the vanilla Bayes optimal classifier. Proposition 9 (Bayes optimal classifier) Consider the problem of binary classification where Y = { 1, +1}. Suppose ℓ(h(x), y) = ϕ(yh(x)) for some ϕ : R [0, ) which is either the 0/1 loss, or a convex loss function that is differentiable at 0 with ϕ (0) < 0. Suppose the uncertainty set W is as specified in Equation (5). Moreover, suppose {gi}i=1...m are convex and differentiable functions. Then, the vanilla Bayes optimal classifier is also a RAI-Bayes optimal classifier. Remark 10 In the special case of m = 1 in Equation (5), we recover the result of [Hu et al., 2018]. However, our proof is much more elegant than the proof of [Hu et al., 2018], and relies on the dual representation of the population RAI risk. One perspective of the above result is that the vanilla Bayes optimal classifier is also responsible as specified by the RAI game. This is actually reasonable in many practical prediction problems where the label annotations are actually derived from humans, who presumably are also responsible. Why then might we be interested in the RAI risk? One advantage of the RAI risks is in finite sample settings where the equivalence no longer holds, and the RAI risk could be construed as encoding prior knowledge about properties of the Bayes optimal classifier. We also note that the above equivalence is specific for binary classification. 6.2 Generalization Guarantees Our generalization bounds rely on the following dual characterization of the RAI population risk. Proposition 11 Suppose the uncertainty set W is as specified in Equation (5). Then for any hypothesis h, the population RAI risk can be equivalently written as RW (h) = inf λ 0,τ EPdata G λ(ℓh(z) τ) + i=1 λici + τ, (6) where G λ is the Fenchel conjugate of Gλ(t) = Pm i=1 λigi(t). We utilize the above expression for RW (h) to derive the following deviation bound for b RWn(h). Theorem 12 Consider the setting of Proposition 11. Suppose {gi}i=1...m are convex and differentiable functions. Suppose ℓh(z) [0, B] for all h H, z Z. Suppose, for any distribution Pdata, the minimizers (λ , τ ) of Equation (6) lie in the following set: E = {(λ, τ) : λ Λ, |τ | T}. Moreover, let s suppose the optimal λ for Pdata is bounded away from 0 and satisfies mini λ i Λ . Let G, L, be the range and Lipschitz constants of G λ: G := sup (λ,τ) E G λ(B τ) G λ( τ), L := sup x [0,B],(λ,τ) E,λ:mini λi Λ G λ(x τ) (λ, τ) For any fixed h H, with probability at least 1 2e t |RW (h) b RWn(h)| 10n 1/2G( p t + m log(n L)). Given Theorem 12, one can take a union bound over the hypothesis class H to derive the following uniform convergence bounds. Corollary 13 Let N(H, ϵ, L (Z)) be the covering number of H in the sup-norm which is defined as h L (Z) = supz Z |h(z)|. Then with probability at least 1 N(H, ϵn, L (Z))e t, the following holds for any h H: |RW (h) b RWn(h)| 30n 1/2G( p t + m log(n L)). Here ϵn = n 1/2G p t + m log(n L). The above bound depends on parameters (λ , τ , G, L) which specific to the constraint set W. To instantiate it for any W one needs to bound these parameters. We note that our generalization guarantees become sub-optimal as Λ 0. This is because the Lipschitz constant L could potentially get larger as λ approaches the boundary. Improving these bounds is an interesting future direction. Remark 14 We note that aforementioned results results follow from relatively stringent assumptions. Exploring the impact of relaxing these assumptions is an interesting direction for future works. 7 Experiments In this section, we demonstrate the generality of proposed RAI methods by studying one of the most well-studied problems in RAI i.e. the case of subpopulation shift. Given a large number of possible W, we acknowledge that this is not a complete analysis, even with respect to the problems that live within the minimax framework. Instead, we aim to display convergence, plug-and-play generality, and superior performance over some seminal baselines of this task. We conduct experiments on both synthetic and real-world datasets. Please refer to Appendix for details on synthetic experiments. We consider a number of responsible AI settings, including subpopulation shift, in the domain-oblivious (DO) setting where we do not know the sub-populations [Hashimoto et al., 2018, Lahoti et al., 2020, Zhai et al., 2021a], the domain-aware (DA) setting where we do [Sagawa et al., 2019], and the partially domain-aware (PDA) setting where only some might be known. Datasets & Domain Definition. We use the following datasets: COMPAS [Angwin et al., 2016], CIFAR-10 (original, and with a class imbalanced split [Jin et al., 2021, Qi et al., 2021]) and CIFAR100. See the Appendix for more details on our datasets. For COMPAS, we consider race (White vs Other) and biological gender (Male vs Female) as our sensitive attributes. This forms four disjoint subgroups defined by these attributes. In the PDA setting, we partition only across the attribute race while training, but still run tests for all four subgroups. On CIFAR-10, class labels define our 10 subpopulations. Similarly as above, for the PDA setting, we make 5 super-groups of two classes each. On CIFAR-100, class labels define our 100 subpopulations. For the PDA setting, we make 20 super-groups, each consisting of five classes. Baselines. We compare our method against the following baselines: (a) Deterministic classifiers trained on empirical risk (ERM) and DRO risks, particularly the quasi-online algorithm for Group DRO [Sagawa et al., 2019] (Online GDRO), and an ITLM-inspired SGD algorithm [Zhai et al., 2021b, Shen and Sanghavi, 2018] for χ2 DRO (SGD (χ2)) (b) Ensemble models Ada Boost [Schapire, 1999]. Note that the purpose of our experiments is to show that we can match baselines for a specific single desideratum (e.g. worst-case sub-population) while allowing for learning models that can solve multiple responsible AI desiderata at the same time, for which we have no existing baselines. Proposed Methods. We focus on Algorithm 2 and refer to FW and Gen-Ada Boost updates as RAI-FW and RAI-GA, respectively. Moreover, our implementations include the following alterations: We track the unregularized objective value from Equation 1 for the validation set, and whenever it increases we double the regularization factor η, which we find can improve generalization. We also use this objective w.r.t the normalized Qt to perform a line search for the step size α. For the FW update, our search space is a ball around 1 t at round t, while for GA, we search within (0, 1). Base Learners & Training. Training time scales linearly with the number of base learners. Inference, though, can be parallelized if need be. We usually find training on 3-5 learners is good enough on all scenarios explored in the paper. We defer further details of our base learners and hyperparameter choices to the Appendix. Constraint sets Wn. For RAI algorithms, we use the following constraint sets: Domain Oblivious (DO): We use the χ2-DRO constraint set to control for worst-case subpopulations. Domain Aware (DA): We use the Group DRO constraint set as the domain definitions are known. Partially Domain-Aware (PDA): We use a novel set Wn which is the intersection over Group DRO constraints over the known domains and χ2 constraints to control for unknown group performance. For baselines, we use Ada Boost and SGD(χ2) for the DO setting. Online GDRO serves as our baseline for both DA and PDA settings, where the algorithm uses whatever domain definitions are available. Table 2: (Table 1 in the paper) Mean and worst-case expected loss for baselines, RAI-GA and RAI-FW. (Complex) indicates the use of larger models. Constraint sets Wn are indicated in (.). Each experiment is carried out over three random seeds and confidence intervals are reported. Setting Algorithm COMPAS CIFAR-10 (Imbalanced) CIFAR10 CIFAR100 Average Worst Group Average Worst Class Average Worst Class Average Worst Class DO ERM 31.3 0.2 31.7 0.1 12.1 0.3 30.4 0.2 8.3 0.2 21.3 0.5 25.2 0.2 64.0 0.7 (Complex) RAI-GA (χ2) 31.3 0.2 31.2 0.2 11.7 0.4 29.0 0.3 8.2 0.1 19.0 0.1 25.6 0.4 56.8 0.8 RAI-FW (χ2) 31.2 0.1 31.4 0.3 11.9 0.1 29.1 0.2 8.0 0.3 15.4 0.4 25.4 0.2 58.0 1.1 ERM 32.1 0.3 34.6 0.4 14.2 0.1 33.6 0.3 11.4 0.4 27.0 0.1 27.1 0.3 66.0 1.1 Ada Boost 31.8 0.4 32.6 0.3 15.2 0.2 40.6 0.2 12.0 0.1 28.7 0.3 28.1 0.2 72.2 1.2 SGD (χ2) 32.0 0.2 33.7 0.2 13.3 0.3 31.7 0.4 11.3 0.3 24.7 0.1 27.4 0.1 65.9 1.2 RAI-GA (χ2) 31.5 0.2 33.2 0.3 14.0 0.1 32.2 0.2 10.8 0.4 25.0 0.2 27.4 0.4 65.0 0.8 RAI-FW (χ2) 31.6 0.1 32.5 0.5 13.9 0.1 32.6 0.3 10.9 0.4 23.4 0.2 27.5 0.1 63.8 0.6 DA Online GDRO 31.7 0.2 32.2 0.3 13.1 0.2 26.6 0.2 11.2 0.1 21.7 0.3 27.3 0.1 57.0 0.5 RAI-GA (Group) 32.0 0.1 32.7 0.1 13.0 0.3 27.3 0.4 11.5 0.1 22.4 0.2 27.4 0.2 56.6 1.1 RAI-FW (Group) 32.1 0.2 32.3 0.2 13.0 0.2 26.0 0.1 11.4 0.3 20.3 0.1 27.9 0.2 52.9 0.9 PDA Online GDRO 31.5 0.1 32.7 0.2 13.4 0.1 32.2 0.2 11.3 0.2 25.2 0.1 27.7 0.2 64.0 0.8 RAI-GA (Group χ2) 31.4 0.4 32.9 0.2 13.0 0.3 30.1 0.1 10.8 0.2 23.7 0.2 27.5 0.1 62.5 0.6 RAI-FW (Group χ2) 31.8 0.2 32.3 0.1 13.5 0.3 29.4 0.3 11.2 0.4 24.0 0.2 27.9 0.3 58.9 0.7 Results and Discussion. We run our methods and baselines under the settings described above and report the results in Table 2. As such, we can make the following observations: 1. RAI-FW and RAI-GA methods significantly improve the worst-case performance with only a few base learners across all datasets in all three settings, while maintaining average case performance. Moreover, For seemingly harder tasks i.e. a large gap between average and worst-case performance, the algorithms are able to improve significantly over the baselines. For example, we observe a 5% improvement in performance in the case of CIFAR-100. 2. The plug-and-play framework allows for several different Wn to enhance various responsible AI qualities at once. We demonstrate this with the partial domain aware setting (PDA), where the performance lead widens, indicating that RAI is able to jointly optimize effectively for both known and unknown subpopulations while Online GDRO suffers from some of the group information being unknown. In practice, one can construct many more novel sets Wn. 3. Although bigger (complex) models exhibit stronger performance than RAI ensembles, there are several caveats to this observation. Firstly, these models are 10-15 times larger than our base models. This limits their use w.r.t both training & inference compute required. However, RAI ensembles utilize a small number of much smaller models which can be individually trained quite easily. Even with these large models as base learners, constructing ensembles exhibits a performance boost, indicating that our framework is able to boost models of varying complexities. 8 Conclusion Under the umbrella of responsible AI , an emerging line of work has attempted to formalize desiderata ranging over ethics, fairness, robustness, and safety, among others. Many of these settings (Table 1) can be written as min-max problems involving optimizing some worst-case loss under a set of predefined distributions. For all the problems that can be framed as above, we introduce and study a general framework, which we refer to as Responsible AI (RAI) games. Our framework extends to classical boosting scenarios, offering boosting-based algorithms for RAI games alongside proven convergence guarantees. We propose practical algorithms to solve these games, as well as statistical analyses of solutions of these games. We find that RAI can guarantee multiple responsible AI aspects under appropriate choices of uncertainty sets. Acknowledgements We acknowledge the support of DARPA via HR00112020006, and NSF via IIS-1909816. Microsoft. Microsoft. https://www.microsoft.com/en-us/ai/responsible-ai, 2021. Accessed: Date Accessed. Google. Google. https://ai.google/responsibility/responsible-ai-practices/, 2020. Accessed: Date Accessed. Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives. Advances in neural information processing systems, 30, 2017. John Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. ar Xiv preprint ar Xiv:1810.08750, 2018. Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019. Runtian Zhai, Chen Dan, Arun Suggala, J Zico Kolter, and Pradeep Ravikumar. Boosted cvar classification. Advances in Neural Information Processing Systems, 34:21860 21871, 2021a. Tatsunori B. Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. International Conference On Machine Learning, 2018. Runtian Zhai, Chen Dan, Zico Kolter, and Pradeep Ravikumar. Doro: Distributional and outlier robust optimization. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12345 12355. PMLR, 18-24 Jul 2021b. URL https://proceedings.mlr.press/v139/ zhai21a.html. Yuji Roh, Kangwook Lee, Steven Euijong Whang, and Changho Suh. Fr-train: A mutual informationbased approach to fair and robust training. ar Xiv preprint ar Xiv: Arxiv-2002.10234, 2020. Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a metaalgorithm and applications. Theory of computing, 8(1):121 164, 2012. Brendan Mc Mahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 525 533, Fort Lauderdale, FL, USA, 11 13 Apr 2011. PMLR. Sébastien Bubeck. Introduction to online optimization. Lecture notes, 2:1 86, 2011. Weihua Hu, Gang Niu, Issei Sato, and Masashi Sugiyama. Does distributionally robust supervised learning give robust classifiers? In International Conference on Machine Learning, pages 2029 2037. PMLR, 2018. Peter Bühlmann. Invariance, causality and robustness. ar Xiv preprint ar Xiv: 1812.08233, 2018. Soroosh Shafieezadeh-Abadeh, Peyman Mohajerin Esfahani, and D. Kuhn. Distributionally robust logistic regression. Neural Information Processing Systems, 2015. Rui Gao and Anton Kleywegt. Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research, 2022. Hongseok Namkoong and John C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Neural Information Processing Systems, 2016. URL https: //api.semanticscholar.org/Corpus ID:7481496. Aharon Ben-Tal, Dick den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Advanced Risk & Portfolio Management Research Paper Series, 2011. URL https://api.semanticscholar. org/Corpus ID:761793. Debmalya Mandal, Samuel Deng, Suman Jana, Jeannette Wing, and Daniel J Hsu. Ensuring fairness beyond the training data. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18445 18456. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/ 2020/file/d6539d3b57159babf6a72e106beb45bd-Paper.pdf. Natalia Martinez, Martin Bertran, and Guillermo Sapiro. Minimax pareto fairness: A multi objective perspective. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 6755 6764. PMLR, 13-18 Jul 2020. URL https://proceedings.mlr.press/v119/ martinez20a.html. L. Oneto, Michele Donini, Amon Elders, and M. Pontil. Taking advantage of multitask learning for fair classification. AAAI/ACM Conference on AI, Ethics, and Society, 2018. doi: 10.1145/3306618. 3314255. Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pages 5637 5664. PMLR, 2021. Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. Advances in Neural Information Processing Systems, 32:1567 1578, 2019. Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, and Sanjiv Kumar. Long-tail learning via logit adjustment. In International Conference on Learning Representations, 2021. Ganesh Ramachandra Kini, Orestis Paraskevas, Samet Oymak, and Christos Thrampoulidis. Labelimbalanced and group-sensitive classification under overparameterization. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. Aman Sinha, Hongseok Namkoong, Riccardo Volpi, and John Duchi. Certifying some distributional robustness with principled adversarial training. ar Xiv preprint ar Xiv:1710.10571, 2017. Rui Gao, Xi Chen, and Anton J Kleywegt. Wasserstein distributionally robust optimization and variation regularization. Operations Research, 2022. Matthew Staib and Stefanie Jegelka. Distributionally robust optimization and generalization in kernel methods. Advances in Neural Information Processing Systems, 32, 2019. Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, and Bernhard Schölkopf. Kernel distributionally robust optimization. ar Xiv preprint ar Xiv:2006.06981, 2020. Mike Li, Hongseok Namkoong, and Shangzhou Xia. Evaluating model performance under worst-case subpopulations. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021a. URL https://openreview.net/ forum?id=nehzx Ady Jx F. Qi Qi, Yi Xu, Rong Jin, Wotao Yin, and Tianbao Yang. Attentional biased stochastic gradient for imbalanced classification. ar Xiv preprint ar Xiv:2012.06951, 2020. Ramnath Kumar, Kushal Majmundar, Dheeraj Nagaraj, and Arun Sai Suggala. Stochastic re-weighted gradient descent via distributionally robust optimization. ar Xiv preprint ar Xiv:2306.09222, 2023. Jikai Jin, Bohang Zhang, Haiyang Wang, and Liwei Wang. Non-convex distributionally robust optimization: Non-asymptotic analysis. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https: //openreview.net/forum?id=g ZLh HMyxa-. Leo Breiman. Prediction games and arcing algorithms. Neural computation, 11(7):1493 1517, 1999. Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics, 28(2): 337 407, 2000. Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189 1232, 2001. Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory, pages 23 37. Springer, 1995. Yoav Freund, Robert E Schapire, et al. Experiments with a new boosting algorithm. In icml, volume 96, pages 148 156. Citeseer, 1996. Llew Mason, Jonathan Baxter, Peter L Bartlett, and Marcus R Frean. Boosting algorithms as gradient descent. In Advances in neural information processing systems, pages 512 518, 2000. Robert E Schapire. The boosting approach to machine learning: An overview. In MSRI workshop on Nonlinear Estimation and Classification, 1999. Ayhan Demiriz, Kristin P Bennett, and John Shawe-Taylor. Lpboost: A boosting algorithm with linear programming. In International Conference on Machine Learning, pages 143 150. AAAI Press, 2002. Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Boosting algorithms as gradient descent. In S. Solla, T. Leen, and K. Müller, editors, Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999. URL https://proceedings.neurips.cc/paper_ files/paper/1999/file/96a93ba89a5b5c6c226e49b88973f46e-Paper.pdf. Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. ar Xiv preprint ar Xiv: Arxiv-1603.02754, 2016. Dinghuai Zhang, Hongyang Zhang, Aaron Courville, Yoshua Bengio, Pradeep Ravikumar, and Arun Sai Suggala. Building robust ensembles via margin boosting. In International Conference on Machine Learning, pages 26669 26692. PMLR, 2022. Laurent Meunier, Meyer Scetbon, Rafael B Pinot, Jamal Atif, and Yann Chevaleyre. Mixed nash equilibria in the adversarial examples game. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7677 7687. PMLR, 18-24 Jul 2021. URL https://proceedings. mlr.press/v139/meunier21a.html. Maria-Florina Balcan, Rattana Pukdee, Pradeep Ravikumar, and Hongyang Zhang. Nash equilibria and pitfalls of adversarial training in adversarial robustness games. In International Conference on Artificial Intelligence and Statistics, pages 9607 9636. PMLR, 2023. M. A. Bennouna and Bart P. G. Van Parys. Holistic robust data-driven decisions. ARXIV.ORG, 2022. doi: 10.48550/ar Xiv.2207.09560. Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214 226, 2012. Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In International conference on machine learning, pages 325 333. PMLR, 2013. Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016a. Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th international conference on world wide web, pages 1171 1180, 2017. Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. Advances in neural information processing systems, 30, 2017. John Rawls. A theory of justice: Revised edition. Harvard university press, 2020. Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness in machine learning. Nips tutorial, 1: 2017, 2017. Alexandra Chouldechova and Aaron Roth. The frontiers of fairness in machine learning. ar Xiv preprint ar Xiv:1810.08810, 2018. Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6):1 35, 2021. Mike Li, Hongseok Namkoong, and Shangzhou Xia. Evaluating model performance under worst-case subpopulations. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 17325 17334. Curran Associates, Inc., 2021b. URL https://proceedings.neurips.cc/paper_files/ paper/2021/file/908075ea2c025c335f4865f7db427062-Paper.pdf. Manfred K Warmuth, Jun Liao, and Gunnar Rätsch. Totally corrective boosting algorithms that maximize the margin. In Proceedings of the 23rd international conference on Machine learning, pages 1001 1008, 2006. Peter Bartlett, Yoav Freund, Wee Sun Lee, and Robert E Schapire. Boosting the margin: A new explanation for the effectiveness of voting methods. The annals of statistics, 26(5):1651 1686, 1998. Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. In International Conference on Machine Learning, pages 4615 4625. PMLR, 2019. Weihua Hu, Gang Niu, Issei Sato, and Masashi Sugiyama. Does distributionally robust supervised learning give robust classifiers? International Conference On Machine Learning, 2016. Alexandre Lacasse, François Laviolette, Mario Marchand, Pascal Germain, and Nicolas Usunier. Pac-bayes bounds for the risk of the majority vote and the variance of the gibbs classifier. Advances in Neural information processing systems, 19, 2006. Pascal Germain, Alexandre Lacasse, Francois Laviolette, Mario Marchand, and Jean-Francis Roy. Risk bounds for the majority vote: From a pac-bayesian analysis to a learning algorithm. ar Xiv preprint ar Xiv:1503.08329, 2015. Andrés Masegosa, Stephan Lorenzen, Christian Igel, and Yevgeny Seldin. Second order pac-bayesian bounds for the weighted majority vote. Advances in Neural Information Processing Systems, 33: 5263 5273, 2020. Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, learning, and games. Cambridge university press, 2006. Ruitong Huang, Tor Lattimore, András György, and Csaba Szepesvári. Following the leader and fast rates in online linear prediction: Curved constraint sets and other regularities. The Journal of Machine Learning Research, 18(1):5325 5355, 2017. Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In International conference on machine learning, pages 427 435. PMLR, 2013. Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and trends in Optimization, 1 (3):127 239, 2014. Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski. Lectures on stochastic programming: modeling and theory. SIAM, 2021. Preethi Lahoti, Alex Beutel, Jilin Chen, Kang Lee, Flavien Prost, Nithum Thain, Xuezhi Wang, and Ed Chi. Fairness without demographics through adversarially reweighted learning. Advances in neural information processing systems, 33:728 740, 2020. Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: There s software used across the country to predict future criminals. and its biased against blacks. Pro Publica, May 2016. URL https://www.propublica.org/article/ machine-bias-risk-assessments-in-criminal-sentencing. Qi Qi, Zhishuai Guo, Yi Xu, Rong Jin, and Tianbao Yang. An online method for a class of distributionally robust optimization with non-convex objectives. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 10067 10080. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/ 533fa796b43291fc61a9e812a50c3fb6-Paper.pdf. Yanyao Shen and Sujay Sanghavi. Learning with bad training data via iterative trimmed loss minimization. International Conference On Machine Learning, 2018. Nidhi Kalra and Susan M. Paddock. Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? Transportation Research Part A: Policy and Practice, 94:182 193, 2016. Andreas Fuster, Paul Goldsmith-Pinkham, Tarun Ramadorai, and Alexander Walther. Predictably unequal? the effects of machine learning on credit markets. Social Science Research Network, (3072038), 2018. Xinsong Ma, Zekai Wang, and Weiwei Liu. On the tradeoff between robustness and fairness. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 26230 26241. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ a80ebbb4ec9e9b39789318a0a61e2e43-Paper-Conference.pdf. Anastasia Chan. Gpt-3 and instructgpt: technological dystopianism, utopianism, and contextual perspectives in ai ethics and industry. AI and Ethics, 3(1):53 64, 2023. Emily M Bender, Timnit Gebru, Angelina Mc Millan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610 623, 2021. Christos Louizos, Kevin Swersky, Yujia Li, M. Welling, and R. Zemel. The variational fair autoencoder. International Conference on Learning Representations, 2015. Moritz Hardt, Eric Price, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016b. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/ 9d2682367c3935defcb1f9e247a97c0d-Paper.pdf. Haohan Wang, Zexue He, Zachary Chase Lipton, and E. Xing. Learning robust representations by projecting superficial statistics out. International Conference on Learning Representations, 2018. Sravanti Addepalli, Anshul Nasery, R. Venkatesh Babu, Praneeth Netrapalli, and Prateek Jain. Learning an invertible output mapping can mitigate simplicity bias in neural networks. ar Xiv preprint ar Xiv: 2210.01360, 2022. Junbum Cha, Kyungjae Lee, Sungrae Park, and Sanghyuk Chun. Domain generalization by mutualinformation regularization with pre-trained models. European Conference on Computer Vision, 2022. doi: 10.48550/ar Xiv.2203.10789. Elad Hazan. Introduction to online convex optimization. Found. Trends Optim., 2016. doi: 10.1561/ 2400000013. Kartik Gupta, Arun Sai Suggala, Adarsh Prasad, Praneeth Netrapalli, and Pradeep Ravikumar. Learning minimax estimators via online learning. ARXIV.ORG, 2020. Jonathan Michael Borwein. A very complicated proof of the minimax theorem. 2016. Abraham Wald. Generalization of a theorem by v. neumann concerning zero sum two person games. Annals of Mathematics, 46:281, 1945. Stephen Simons. Minimax theorems and their proofs. 1995. TES Raghavan. Zero-sum two-person games. Handbook of game theory with economic applications, 2:735 768, 1994. Andrew Cotter, Maya Gupta, and Harikrishna Narasimhan. On making stochastic classifiers deterministic. Advances in Neural Information Processing Systems, 32, 2019. Jimmy Wu, Yatong Chen, and Yang Liu. Metric-fair classifier derandomization. In International Conference on Machine Learning, pages 23999 24016. PMLR, 2022. H Brendan Mc Mahan. A survey of algorithms and analysis for adaptive online learning. The Journal of Machine Learning Research, 18(1):3117 3166, 2017. Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. R Tyrrell Rockafellar. Convex analysis. Number 28. Princeton university press, 1970. Peter L Bartlett, Michael I Jordan, and Jon D Mc Auliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006. Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8(5), 2007. Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019. Robert M. Freund and Paul Grigas. New analysis and results for the frank wolfe method. Mathematical Programming, 155:199 230, 2014. doi: 10.1007/s10107-014-0841-6. URL http://link.springer.com/article/10.1007/s10107-014-0841-6/fulltext.html. Sergey Zagoruyko and N. Komodakis. Wide residual networks. British Machine Vision Conference, 2016. doi: 10.5244/C.30.87. A Broader Impact 18 B Limitations 18 C Terminology and Notation 19 C.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 C.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 D Background 19 D.1 Two Player Zero-sum Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 D.2 Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 D.3 General Minimax Theorems (Proof of Proposition 1) . . . . . . . . . . . . . . . . 21 E Ensemble RAI Games 21 E.1 Alternative Definitions: Deterministic Ensembles . . . . . . . . . . . . . . . . . . 21 E.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 F Algorithms 22 F.1 Convergence Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 F.2 Closed Form Updates for different uncertainty sets . . . . . . . . . . . . . . . . . 23 F.3 Proof of Corollary 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 G Generalization 24 G.1 Population Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 G.2 Generalization Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 G.3 Proof of Corollary 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 H Algorithms: Further Discussion 28 H.1 Equivalence Conditions for RAI Algorithms (1 and 2) . . . . . . . . . . . . . . . . 28 H.2 Weak Learning Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 I Experiments 29 I.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 I.2 Further Details for Section 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 I.3 Interesting Observation: Boosting Robust Learners . . . . . . . . . . . . . . . . . 30 I.4 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 A Broader Impact Responsible AI has become an important topic as ML/AI systems increase in scale, and are being deployed in a variety of scenarios. Models not optimized for responsible facets can have disastrous consequences [Kalra and Paddock, 2016, Angwin et al., 2016, Fuster et al., 2018]. Our research promotes the principles of Responsible AI (RAI), encompassing ethics, fairness, and safety considerations. By providing a framework to consider these simultaneously, this research could help prevent scenarios where optimizing for one aspect unintentionally compromises another [Ma et al., 2022], mitigating the risk of creating biases or vulnerabilities in AI systems. We also address the fragmentation in recent work around Responsible AI. The plug-and-play nature of our approach allows for the easy adaptation of algorithms for different Responsible AI considerations. This adaptability could lead to more practical and user-friendly tools for implementing RAI across different applications. Numerous ML models like Large language models, (GPT-3, GPT-4), are rapidly transforming numerous domains, including natural language processing, data analysis, content creation, and more. Given their remarkable ability to generate human-like text, they can be used to construct narratives, answer queries, or even automate customer service. However, with such capabilities come significant responsibilities, as these models can inadvertently perpetuate biases, misinformation, or harmful content if not correctly regulated. Therefore, ensuring the responsible behavior of these models is crucial [Chan, 2023, Bender et al., 2021]. Our research presents a general framework that can be pivotal in the responsible design of such large language models. B Limitations We now identify some limitations of our current work, along with corresponding future directions. To more concretely establish the empirical superiority of our optimization techniques, more experiments involving large over-parametrized models need to be conducted. The proposed generalization bounds are not tight for all risks. A more careful analysis would be needed for such a generalization. Our framework only handles uncertainty sets that are supported on the training data. It d be interesting to generalize our framework further to support other uncertainty sets based on Wasserstein divergences that are not necessarily supported on the training data. Finally, the presence of outliers can often destabilize training of large models[Zhai et al., 2021b]. However, our current framework assumes the training data is un-corrupted. In future, we aim to extend our framework to support corruptions in the training data. We note that our framework can also be extended to the problem of adversarial test-time robustness where there is an adversary corrupting the inputs sent to the model during inference. Let A(x) be the set of perturbations that the adversary can add to input x. The uncertainty set in this case contains distributions supported on {(x , y ) : (x, y) b Pdata such that x x + A(x), y = y}. Our framework primarily covers RAI aspects which can written as minmax problems. Here we provide some other notions of RAI which our framework does not directly cover. Fairness: Various notions of fairness have been studied by the ML community. While our framework captures minimax group fairness, it doesn t capture other notions of group fairness such as Demographic Parity [Louizos et al., 2015], Equality of Odds, Equality of Opportunity [Hardt et al., 2016b]. Our framework doesn t capture individual fairness notions. Robustness: While our framework covers group robustness and certain forms of distributional robustness, it doesn t cover robustness to Wasserstein perturbations and other (un)structured distribution shifts often studied in the domain generalization community [Wang et al., 2018, Addepalli et al., 2022, Cha et al., 2022]. C Terminology and Notation C.1 Terminology Strongly Convex Sets A set A is λ-strongly convex w.r.t a norm , if, for any x, y A, γ [0, 1], the norm ball with origin γx + (1 γ)y and radius γ(1 γ)λ x y 2/2 lies in A. f-divergence For any two probability distributions P, Q, f-divergence between P and Q is defined as Df(Q||P) = EP [f(d Q/d P)]. Here f : R+ R is a convex function such that f(1) = 0. C.2 Notation Table 3: Notation Symbol Description X Input Random Variable X Input Domain Y Output Random Variable Y Output Domain Z Sample Random Variable (X, Y ) Z Sample Domain X Y S Sample Set Pdata Data Generating Distribution b Pdata Empirical Distribution (Uniform) over S H Set of Hypothesis Q Distribution over Hypothesis from H h Any given hypothesis hrand;Q Randomized Ensemble given by Q hdet;Q De-randomized/Deterministic Classifier corresponding to hrand;Q l Loss Function b R(h) Empirical Risk of h Wn Set of allowed sample weights (aka Uncertainty Set) b RWn(h) Empirical RAI Risk of h RW (h) Population RAI Risk of h b Rrand;Wn(Q) Randomized Ensemble RAI Risk KL(p||q) KL-divergence Metric between p and q Gi Subpopulation/Domain i Reg Any given regularizer function D Background D.1 Two Player Zero-sum Games Consider the following game between two players. One so-called row player playing actions h H, and the other column player playing actions z Z. Suppose that when the two players play actions h, z respectively, the row player incurs a loss of l(h, z) R, while the column player incurs a loss of l(h, z). The sum of the losses for the two players can be seen to be equal to zero so such a game is known as a two-player zero-sum game. It is common in such settings to refer to the gain l(h, z) of the column player, rather than its loss of l(h, z). Both players try to maximize their gain/minimize their loss. It is common in game theory to consider a linearized game in the space of probability measures, which is in general better behaved. To set up some notation, for any probability distributions Ph over H, and Pz over Z, define: l(Ph, Pz) = EPh,Pzl(h, z) Nash Equilibrium A Nash Equilibrium (NE) is a stable state of a game where no player can gain by unilaterally changing their strategy while the other players keep theirs unchanged. In a two-player zero-sum game, a Nash Equilibrium is a pair of mixed strategies (h , z ) satisfying sup z Z l(h , z) l(h , z ) inf h H l(h, z ) Note that whenever a pure strategy NE exists, the minimax and maximin values of the game are equal to each other: inf h H sup z Z l(h, z) = l(h , z ) = sup z Z inf h H l(h, z) What often exists is a mixed strategy NE, which is precisely a pure strategy NE of the linearized game. That is, (P h, P z ) is called a mixed strategy NE of the zero-sum game, if sup Pz PZ l(P h, Pz) l(P h, P z ) inf Ph PH l(Ph, P z ) For this paper, Q Ph, w Pz, H PH and Wn PZ. ϵ-Approximate Nash Equilibrium An ϵ-approximate Nash Equilibrium is a relaxation of the Nash Equilibrium, where each player s strategy may not be the best response but is still within ϵ of the best response. Formally, a pair of mixed strategies (Ph, Pz) is an ϵ-approximate Nash Equilibrium if inf Ph PH l(Ph, Pz) + ϵ l(Ph, Pz) sup Pz PZ l(Ph, Pz) ϵ (7) No Regret Algorithms No-regret algorithms are a class of online algorithms used in repeated games. The regret of a player is defined as the difference between their cumulative payoff and the best cumulative payoff they could have achieved by consistently playing a single strategy. A no-regret algorithm guarantees that the average regret of a player goes to zero as the number of iterations (or rounds) tends to infinity. In the context of two-player zero-sum games, if both players follow no-regret algorithms, their average strategy profiles converge to the set of Nash Equilibria. D.2 Online Learning A popular and widely used approach for solving min-max games is to rely on online learning algorithms [Hazan, 2016, Cesa-Bianchi and Lugosi, 2006]. In this approach, the row (minimization) player and the column (maximization) player play a repeated game against each other. Both players rely on online learning algorithms to choose their actions in each round of the game, with the objective of minimizing their respective regret. The following proposition shows that this repeated gameplay converges to a NE. Proposition 15 ([Gupta et al., 2020]) Consider a repeated game between the minimization and maximization players in the linearized game. Let (P t h, P t z) be the actions chosen by the players in iteration t. Suppose the actions are such that the regret of each player satisfies: t=1 l(P t h, P t z) inf h H t=1 l(h, P t z) ϵ1(T) t=1 l(P t h, z) t=1 l(P t h, P t z) ϵ2(T) Let Ph AV G, Pz AV G denote the mixture distributions 1 T PT i=1 P i h and 1 T PT i=1 P i z. Then (Ph AV G, Pz AV G) is an ϵ-approximate mixed NE of the game with: ϵ = ϵ1(T) + ϵ2(T) There exist several algorithms such as FTRL, FTPL, and Best Response (BR), which guarantee sub-linear regret. It is important to choose these algorithms appropriately, given the domains H, Z as our choices impact the rate of convergence to a NE and also the computational complexity of the resulting algorithm. D.3 General Minimax Theorems (Proof of Proposition 1) We first state the following convenient generalization of the original Von Neumann s minimax theorem. Proposition 16 (Von Neumann-Fan minimax theorem, [Borwein, 2016]) Let X and Y be Banach spaces. Let C X be nonempty and convex, and let D Y be nonempty, weakly compact, and convex. Let g : X Y R be convex with respect to x C and concave and upper-semicontinuous with respect to y D, and weakly continuous in y when restricted to D. Then, sup y D inf x C g(x, y) = inf x C sup y D g(x, y) We now proceed to the proof of Proposition 1. Observe that a convex, compact Wn satisfies the conditions for D in the above proposition. Moreover, we have C = H i.e. the set of probability measures on Θ. It is indeed nonempty and convex. Also, our g is bilinear in Q and w, and thus is convex-concave. Thus, Proposition 1 directly follows from the above result. Relaxations We can relax the assumption that h is parameterized by a finite dimensional vector θ. For simpler H, the minimax result directly holds with mild assumptions. If H = {h1, h2, ...hn} i.e. H is finite. Then the original minimax theorem by Neumann holds for arbitrary functions l. If H = {h1, h1, ...} i.e. H is denumerable. We further assume l is a bounded loss function. Then from Theorem 3.1 from [Wald, 1945] to compact and convex W over n (finite) datapoints, we can conclude the relation holds. Hoever, minimax theorems for more general H require other conditions like the continuity of loss, compactness in function space, etc. See [Simons, 1995] and [Raghavan, 1994]. E Ensemble RAI Games E.1 Alternative Definitions: Deterministic Ensembles Alternative definitions for deterministic ensembles could be considered. For example, one could consider hdet;Q(x) = arg miny Y Eh Qℓ(h(x), y). [Cotter et al., 2019, Wu et al., 2022] designed other more sophisticated strategies, but these are largely domain dependent. For reasons that will be explained later, we stick with Definition 3 in this work. For regression, a popular de-randomization strategy is to compute the expected prediction: hdet;Q(x) = Eh Q[h(x)]. E.2.1 Proposition 2 sup w n b Ew I[hdet;Q(x) = y] = sup i [n] I[yi = arg max y Y EQ[h(xi) = y]] = I[ sup w n Ew EQI[h(x) = y] 1/2] = I[ b RWn(hrand;Q) 1/2] as required. E.2.2 Proposition 3 Proof. Denote y Q(x) = arg maxy Y EQ(h(x) = y). Then, b RWn(hdet;Q) = sup w Wn Ewℓ(y Q(x), y) sup w Wn Ewℓ(y Q(x), y)PQ(h(x) = y Q(x)) γQ sup w Wn Ew X y Y ℓ(y , y)PQ(h(x) = y ) = γQ sup w Wn Ew EQ X y Y ℓ(y , y)I[h(x) = y ] = γQ sup w Wn Ew EQℓ(h(x), y) = γQ b RWn(hrand;Q), as required. F Algorithms F.1 Convergence Rates F.1.1 Gameplay We begin with the following lemma adapted from [Mc Mahan, 2017] (Theorem-1) Lemma 17 Consider the setting of Algorithm 1, and further assume that ηt ηt 1 > 0, Reg(w) 0, ηt Reg(w) is 1-strongly concave w.r.t. some norm . (t). Then for any w Wn and any T > 0, we have: Regret D,T ηT 1Reg(w ) + 1 t=1 lt 2 (t 1), (where lt i = l(ht(xi), yi)) Consider ηt = η and . (t) = η . 1, then Regret D,T ηReg(w ) + 1 t=1 lt 2 ηD2 R + T Moreover, as H-player plays BR, Regret H,T 0 Using Proposition 15, we achieve ϵ-approximate NE with: ϵ = ϵT Regret H,T + Regret D,T T = ηD2 R T + O 1 Using definition in Equation 7 gives us the required result. F.1.2 Greedy FW Update Note that we are trying to minimize the objective Lη(Q) w.r.t Q by the FW update. Using properties of Fenchel conjugates, it is well known that Lη(Q) is 1 η smooth w.r.t. . 1. Also, the diameter of the simplex H w.r.t. . 1 is 1. By [Jaggi, 2013] (Lemma 7), we have Cf 1 η, and thus by [Jaggi, 2013] Theorem 1, we have: Lη(QT ) min Q Lη(Q) 2 η(T + 2) = O 1 Using the definition of Lη(Q), L(QT ) min Q L(Q) ηD2 R + O 1 Gen-Ada Boost Update This update boils down to a standard coordinate descent update on convex and 1 η smooth Lη(Q) (w.r.t . 1). Following along the lines of analysis in [Boyd and Vandenberghe, 2004] (Section 9.4.3), Lη(Qt) Lη(Qt 1) η 2 QLη(Qt 1) 2 Using this decent equation, we can follow standard gradient descent analysis to get: Lη(QT ) min Q Lη(Q) O 1 The rest of the argument will go through as above. F.2 Closed Form Updates for different uncertainty sets In this section, we derive closed-form updates for Equation 2 for the entropic regularizer (also mentioned below). We consider common settings mentioned in Table 1. wt argmax w Wn s=1 Ewℓ(hs(x), y) ηt 1 X w log(w) Wn = { b Pdata} (Empirical Risk Minimization) Wn = n (Worst Case Margin) ut 1 where ut i exp Pt 1 s=1 l(hs(xi), yi) Wn = {w : w n, w 1 αn} (α-CVa R) Pt 1 s=1 l(hs(xi), yi) for λ S.T. X Algorithm 3 describes a projection procedure to find such λ in O(n log n) time. Algorithm 3 Projection for α-CVa R set Input: l, η, α Pt 1 s=1 l(hs(xi),yi) 2: v 1 αn 3: if yi yi 1 v i then 4: wi yi yi 1 5: return w 6: else 7: y(i) sort{yi} S.T. y(i) y(j) i j 8: function CANDIDATE(m) 9: cm v y(m) 10: Sm P i v1cmy(i) v + cmy(i)1cmy(i)m y(i) 15: wi min(cm yi, v) 16: return w 17: end if Wn = {w : D(w|| b Pdata) ρn} (DRO) For general f-divergences, there do not exist closed form updates for wt. However, they can still be empirically solved using FW-like updates. Wn = { b Pdata(G1), b Pdata(G2), . . . b Pdata(GK)} (Group DRO) ut 1 where ut i exp i Gk l(hs(xi), yi) for i Gk, sk = |Gk| F.3 Proof of Corollary 8 Proof. Note that Reg is 1-strongly concave w.r.t . 1 and . 2. A conservative upper bound for D2 R log(n) for all Wn( n). Thus, we can thus take appropriate values of η to get: L(QT ) min Q L(Q) + O Thus, we have ϵ O q from Proposition 15. G Generalization G.1 Population Risk We first present a proposition which gives an equivalent characterization of the RAI population risk Proposition 18 The following are equivalent characterizations of the population RAI risk RW (h) = sup P :P Pdata, d P d Pdata W EP [ℓ(h(x), y)]. RW (h) = sup r:Z7 R+, R r(z)d Pdata(z)=1,r W EPdata[r(z)ℓh(z)]. Proof. The equivalence between (1) and (2) follows by reparameterizing P in (1) as follows d P(z) = r(z)d Pdata(z), for some r(z) 0 Now suppose the uncertainty set W is as defined in Equation (5). The next proposition uses duality to derive an equivalent characterization of the RAI risk in this setting. Proposition 19 Suppose the uncertainty set W is as specified in Equation (5). Then for any hypothesis h, the population RAI risk can be equivalently written as RW (h) = inf λ 0,τ EPdata G λ(ℓh(z) τ) + i=1 λici + τ, (8) where G λ is the Fenchel conjugate of Gλ(t) = Pm i=1 λigi(t). Proof. We rely on duality to prove the proposition. First observe that the population RAI risk can be rewritten as RW (h) = sup r:Z7 R+, R r(z)d Pdata(z)=1,r W EPdata[r(z)ℓh(z)] (a) = sup r:Z7 R+ inf λ 0,τ EPdata[r(z)ℓh(z)] + τ 1 Z r(z)d Pdata(z) + ci Z gi(r(z))d Pdata(z) Since the above objective is concave in r and linear in λ, τ, we rely on Lagrangian duality to rewrite it as RW (h) = inf λ 0,τ sup r:Z7 R+ L(r, λ, τ), where L(r, λ, τ) is defined as: L(r, λ, τ) = EPdata r(z)(ℓh(z) τ) i=1 λigi(r(z)) i=1 λici + τ. Recall the interchangeability theorem: Z F(r(z), z)p(z)dz = Z inf t R F(t, z) p(z)dz, so long as the space H is decomposable. Since in our case we are working with the set L1(Z, P) = {r : Z 7 R : R r(z)p(z)dz = 1}, which is decomposable, we can apply the interchangeability theorem to get: sup r:Z7 R+ EPdata r(z)(ℓh(z) τ) i=1 λigi(r(z)) = EPdata sup t 0 i=1 λigi(t) = EPdata G λ(ℓh(z) τ), where Gλ(t) = Pm i=1 λigi(t), and G λ is its Fenchel conjugate. so that: R(ℓh) = inf λ 0,τ i=1 λici + τ + EPdata G λ(ℓh(z) τ). We have the following properties of the Fenchel conjugate G λ(t). These follow from the properties of Fenchel conjugates described in Rockafellar [1970]. Lemma 20 Consider the setting of Proposition 19. The Fenchel conjugate G λ is convex, differentiable and an increasing function that satisfies d G λ(x) dx 0, x R. Proof of Proposition 9 For the sake of clarity, we first state Proposition 9 below. Proposition 21 (Bayes optimal classifier) Consider the problem of binary classification where Y = { 1, +1}. Suppose ℓ(h(x), y) = ϕ(yh(x)) for some ϕ : R [0, ) which is either the 0/1 loss, or a convex loss function that is differentiable at 0 with ϕ (0) < 0. Suppose the uncertainty set W is as specified in Equation (5). Moreover, suppose {gi}i=1...m are convex and differentiable functions. Then, the vanilla Bayes optimal classifier is also a RAI-Bayes optimal classifier. Proof. Following Proposition 19 it is easy to see that the RAI Bayes optimal classifier is the minimizer of the following problem inf h inf λ 0,τ i=1 λici + τ + EPdata G λ(ϕ(yh(x)) τ), where the minimization over h is over the set of all classifiers. For any fixed (λ, τ), we now show that the classifier h that minimizes the above optimization problem is a vanilla Bayes optimal classifier. First note that the above optimization problem, for a fixed (λ, τ), can be rewritten as inf h EPdata G λ(ϕ(yh(x)) τ). Using the interchangeability theorem, we can further rewrite this as inf u { 1,+1} EPdata( |x) [G λ(ϕ(uy) τ)] x . Here, P x data is the marginal distribution of Pdata over x, and Pdata( |x) is the distribution of y conditioned on x. Now suppose ϕ is the 0/1 loss ϕ(x) = 0, if x > 0 1, otherwise . Recall, G λ is an increasing function (see Lemma 20). So G λ( τ) G λ(1 τ). Using this, it is easy to see that for any x, the following is a minimizer of infu { 1,1} EPdata( |x) [G λ(ϕ(uy) τ)] u = 1, if Pdata(y = 1|x) 1 2 1, otherwise . This shows that the vanilla Bayes optimal classifier is a minimizer of the population RAI risk. Now suppose ϕ : R [0, ) is convex, differentiable at 0 with ϕ (0) < 0. Moreover, suppose h : X R is a real valued classifier. In this case, the RAI Bayes optimal classifier is a minimizer of the following objective inf u R EPdata( |x) [G λ(ϕ(uy) τ)] x . Let ι(x) = G λ(ϕ(x) τ). It is easy to see that ι(x) is convex. This is because ι (x) = (G λ) (ϕ(x) τ)ϕ (x) is an increasing function; this follows from the fact that G λ is convex with non-negative gradients. Moreover, ι (0) = (G λ) (ϕ(0) τ)ϕ (0) 0. Then, Bartlett et al. [2006], Tewari and Bartlett [2007] show that for any x, the following u is a minimizer of the inner optimizatin problem: u > 0 if Pdata(y = 1|x) 1 2, u < 0 otherwise. This shows that vanilla Bayes optimal classifier is minimizer of the population RAI risk. G.2 Generalization Guarantees G.2.1 Proof of Proposition 11 Proposition 11 directly follows from Proposition 19. G.2.2 Proof of Theorem 12 We first present a key concentration result we use in the proof. Lemma 22 (Hoeffding bound [Wainwright, 2019]) Suppose that the random variables {Xi}n i=1 are independent with mean µi, and bounded between [a, b]. Then for any t 0, we have i=1 Xi µi| t We now proceed to the proof of the Theorem. Following Proposition 11, we know that the population and empirical RAI risk of a classifier h can be written as RW (h) = inf λ 0,τ i=1 λici + τ + EPdata G λ(ℓh(z) τ) b RWn(h) = inf λ 0,τ i=1 λici + τ + E b Pdata G λ(ℓh(z) τ) Our goal here is to bound the following quantity for any given h: |RW (h) b RWn(h)| sup (λ,τ) E,λ:mini λi Λ EPdata G λ(ℓh(z) τ) E b Pdata G λ(ℓh(z) τ) . The rest of the proof focuses on bounding the RHS of the above equation. The overall idea is to first provide a high probability bound of the RHS for any given λ, τ. Next, we take a union bound over all feasible (λ, τ) s by constructing an appropriate ϵ-net. Fixed λ, τ. Observe that G λ(ℓh(z) τ) is bounded and satisfies G λ( τ) G λ(ℓh(z) τ) G λ(B τ). This follows from the fact that G λ is an increasing function (see Proposition 20), and ℓh is bounded between 0 and B. From Heoffding bound we know that for any fixed h H, the following holds with probability at least 1 2e t EPdata G λ(ℓh(z) τ) E b Pdata G λ(ℓh(z) τ) G Union bound over λ, τ. Define set E as E := {(λ, τ) : λ 0, min i λi Λ} E. (9) Let N(E , ϵ, 2) be the ϵ-net over E in 2 norm. It is well known that there exists such a set whose size is upper bounded by [Wainwright, 2019] |N(E , ϵ, 2)| O Λ + T . For any (λ, τ) E , let (λϵ, τϵ) be an element in N(E , ϵ, 2) that is ϵ-close to (λ, τ). Now consider the following sup (λ,τ) E EPdata G λ(ℓh(z) τ) E b Pdata G λ(ℓh(z) τ) sup (λ,τ) N(E ,ϵ, 2) EPdata G λ(ℓh(z) τ) E b Pdata G λ(ℓh(z) τ) + sup (λ,τ) E EPdata G λ(ℓh(z) τ) EPdata G λϵ(ℓh(z) τϵ) + sup (λ,τ) E E b Pdata G λ(ℓh(z) τ) E b Pdata G λϵ(ℓh(z) τϵ) Since G λ is L-Lipschitz, the last two terms in the RHS above can be upper bounded by Lϵ. Substituting this in the above equation, we get sup (λ,τ) E EPdata G λ(ℓh(z) τ) E b Pdata G λ(ℓh(z) τ) sup (λ,τ) N(E ,ϵ, 2) EPdata G λ(ℓh(z) τ) E b Pdata G λ(ℓh(z) τ) + 2Lϵ where (a) follows from Equation (9), and holds with probability at least 1 Λ+T Choosing ϵ = G t n, we get the desired result. G.3 Proof of Corollary 13 The proof follows from a standard covering number argument. For any h H, let hϵ be the point in the ϵ-net that is closest to h. Then we have sup h H |RW (h) b RWn(h)| sup h N(H,ϵn, L (Z)) |RW (h) b RWn(h)| + sup h H |RW (h) RW (hϵ)| + sup h H | b RWn(h) b RWn(hϵ)| Observe that the last two terms above are bounded by ϵn. Also observe that the first term in the RHS can be upper bounded by 10n 1/2G( p t + m log(n L)) with probability at least 1 2N(H, ϵn, L (Z))e t. Combining these two and substituting the value of ϵn gives us the required result. H Algorithms: Further Discussion H.1 Equivalence Conditions for RAI Algorithms (1 and 2) Proposition 23 Assume that we set αt = 1 t and perform coordinate descent update in Algorithm 2 (FW) i.e. Gt = argminh H h, QLη(Qt 1) , then the update is equivalent to the update given by Algorithm 1 with ηt = ηt. Proof. From Equation 3, and using the fact that Lη(Q) can be written as a fenchel conjugate, we know that QLη(Qt 1) = argmax w Wn Eh Qt 1Ewℓ(h(x), y) + ηReg(w) = argmax w Wn s=1 Ewℓ(hs(x), y) + ηt Reg(w) as αt = 1 This matches Equation 2 with ηt 1 = ηt. Moreover, Gt = argmin h H Eh HEwtl(h(x), y) where wt = QLη(Qt 1) This corresponds to the update for ht in Algorithm 1. Thus, we have our equivalence. H.2 Weak Learning Conditions For the well-known scenario of binary classification and zero-one loss, we recover the quasi-Ada Boost weak learning condition: Proposition 24 Consider the scenario of binary classification and l as the zero-one loss. If the H-player only plays an approximate best response strategy i.e. ht satisfies Ewtℓ(ht(x), y) 1/2 γ for some γ > 0, then b RWn(hdet:QT ) = 0 for T > T0 for some large enough T0. Proof. Since the D-player uses regret optimal strategy, we have that: t=1 Ewtℓ(ht(x), y) max w Wn 1 T t=1 Ewℓ(ht(x), y) ϵT , while from the approximate-BR condition we have that: t=1 Ewtℓ(ht(x), y) 1/2 γ, so that we have: max w Wn 1 T t=1 Ewℓ(ht(x), y) 1/2 γ + ϵT , so that for T > T0 large enough so that ϵT < γ/2, we have that: max w Wn 1 T t=1 Ewℓ(ht(x), y) < 1/2 γ/2. As QT assigns mass 1/T to each of {ht}T t=1, we have: b RWn(hrand;QT ) = max w Wn EQT Ewℓ(ht(x), y) < 1/2 γ/2 = b RWn(hdet:QT ) = 0 (from Proposition 2) Hence, we see that hdet;QT incurs zero error. For the general setting, we have a slightly stronger weak learning condition, which follows from the analysis of Frank-Wolfe update [Freund and Grigas, 2014, Jaggi, 2013]. Proposition 25 Consider Algorithm 2 with the FW update and Reg(.) as a 1-strongly concave regularize w.r.t. . 1. If Gt satisfies Gt min Q Q, QLη(Qt 1) + δt, where {δk} is the sequence of approximation errors with δt 0, then: 1. If δt ϵ η(t+2) i.e. decaying errors, then Lη(QT ) min Q Lη(Q) + 2(1+ϵ) η i.e. constant errors, then Lη(QT ) min Q Lη(Q) + 2 η(T +2) + ϵ Proof. Note that we are trying to minimize the objective Lη(Q) w.r.t Q by the FW update. Using properties of Fenchel conjugates, it is well known that Lη(Q) is 1 η smooth w.r.t. . 1. Also, the diameter of the simplex H w.r.t. . 1 is 1. 1. By [Jaggi, 2013] (Lemma 7, Theorem 1), we have Cf 1 η, and thus, we have: Lη(QT ) min Q Lη(Q) 2(1 + ϵ) 2. By [Freund and Grigas, 2014] (Theorem 5.1), in case of approximation errors, the FW/optimality gap converges as before along with a convex combination of errors at each time step i.e. if we are able to solve the linear optimization problem within constant error ϵ η, then these errors do not accumulate. Moreover, the convex combination can be bound by the maximum error possible and we get, Lη(QT ) min Q Lη(Q) 2 η(T + 2) + ϵ I Experiments RAI games constitute an optimization paradigm that goes beyond traditional approaches such as distributionally robust optimization, fairness, and worst-case performance. We have seen that for specific uncertainty sets W, RAI Games optimize over well-established robust optimization objectives. As such, the purpose of our experiments is to demonstrate the practicality and generality of our proposed strategies, rather than establishing state-of-the-art over baselines. Given a large number of possible W, we do not attempt an exhaustive empirical analysis. Instead, we underscore the plug-and-play nature of RAI Games. Subpopulation Shift A prevalent scenario in machine learning involves subpopulation shift, necessitating a model that performs effectively on the data distribution of each subpopulation (or domain). We explore the following variations of this setting: Domain Oblivious (DO). Recent work [Hashimoto et al., 2018], [Lahoti et al., 2020], [Zhai et al., 2021a] studies the domain-oblivious setting, where the training algorithm lacks knowledge of the domain definition. In this case, approaches like α-CVa R and χ2-DRO aim to maximize performance over a general notion of the worst-off subpopulation. Domain Aware (DA). Several prior works [Sagawa et al., 2019] have investigated the domain-aware setting, in which all domain definitions and memberships are known during training. Partially Domain-Aware (PDA). More realistically, in real-world applications, there usually exist multiple domain definitions. Moreover, some of these domain definitions may be known during training, while others remain unknown. The model must then perform well on instances from all domains, regardless of whether their definition is known. This setting is challenging as it necessitates the model to learn both domain-invariant and domain-specific features and to generalize well to new instances from unknown domains. I.2 Further Details for Section 7 Base Learners We use linear classifiers, WRN-28-1, and WRN-28-5 [Zagoruyko and Komodakis, 2016] as base classifiers for COMPAS, CIFAR-10 and CIFAR-100 respectively. To get a sense of performance improvements, we also benchmark performance with larger models, namely a threehidden-layer neural network for COMPAS and WRN-34-10 for CIFAR-10/100. Proposed Methods This paper introduces two categories of algorithms. We elect not to present results for Algorithm 1, which we notice has similar performance to Algorithm 2. Conversely, we provide in-depth experimental analyses for both updates of Algorithm 2, which warrant some special attention due to due to their relation to Ada Boost. In this section, we refer to the FW and Gen-Ada Boost updates as RAI-FW and RAI-GA, respectively. Our implemented versions incorporate a few alterations: 1. We track the un-regularized objective value from Equation 1 for the validation set. If it increases at any round t, we increase the regularization factor η by a fixed multiple (specifically, 2). We notice that it leads to better generalization performance over the test set. 2. The same un-regularized objective w.r.t normalized Qt is also used to perform a line search for the step size α. For the FW update, our search space is a ball around 1 t at round t, while for the GA update, we search within the range (0, 1). Training. We use SGD with momentum = 0.9 for optimization. We first warm up the model with some predefined epochs of ERM (3 for COMPAS and 20 for CIFAR-10/100), followed by a maximum of T = 5 base models trained from the warm-up model with sample weights provided by our algorithms. Each base model is trained for 500 iterations on COMPAS and 2000 iterations on CIFAR-10/100. Each experiment is run three times with different random seeds. For evaluation, we report the averaged expected and worst-case test loss from Equation 1. Datasets We conduct our experiments on three real-world datasets: COMPAS [Angwin et al., 2016] pertains to recidivism prediction, with the target being whether an individual will re-offend within two years. This dataset is extensively used in fairness research. We randomly sample 70% of the instances for the training data (with a fixed random seed), and the remainder is used for validation/testing. CIFAR-10 and CIFAR-100 are widely used image datasets. For CIFAR-10, we consider two settings: the original set and an imbalanced split [Jin et al., 2021, Qi et al., 2021]. In the imbalanced split, we make worst-case performance more challenging by randomly sampling each category at different ratios. To be precise, we sample the ith class with a sampling ratio ρi where ρ = {0.804, 0.543, 0.997, 0.593, 0.390, 0.285, 0.959, 0.806, 0.967, 0.660}. For these datasets, we use the standard training and testing splits, reserving 10% of the training samples as validation data. Hyperparameters For COMPAS, we warm up for 3 epochs and then train every base classifier for 500 iterations. For CIFAR-10 and CIFAR-100, we warm up the models for 20 epochs and train base classifiers for 2000 iterations. The mini-batch size is set to 128. It should be noted that the primary aim of our experiments is not hyperparameter tuning. The experiments in this paper are designed to demonstrate use cases and compare different algorithms. Hence, while we maintain consistency of hyperparameters across all experiments, we do not extensively tune them for optimal performance. I.3 Interesting Observation: Boosting Robust Learners Table 4: Mean and worst-case expected loss for RAI-FW + robust optimization algorithms. Algorithm COMPAS CIFAR-10 (Imbalanced) CIFAR10 CIFAR100 Average Worst Group Average Worst Class Average Worst Class Average Worst Class SGD (χ2) 32.0 33.7 13.3 31.7 11.3 24.7 27.4 65.9 RAI-FW + SGD (χ2) 30.9 32.2 13.6 31.0 11.2 23.8 27.6 63.8 Online GDRO 31.7 32.2 13.1 26.6 11.2 21.7 27.3 57.0 RAI-FW + Online GDRO 31.6 33.3 12.9 24.4 11.4 19.5 27.8 51.2 Boosting Robust Base Learners We conclude our results with one interesting observation. Until now, we have been comparing our ensembles with deterministic models. As such, we acknowledge that given the inherent differences between the two, making a fair comparison is challenging. However, we find that our setup can "boost" not only ERM but also other robust base learners i.e. if we use these robust optimization methods to find our base learners under analogous RAI constraints, we are able to further enhance the robust performance of these algorithms. The results are shown in Table 4. We hypothesize that individually robust base learners are able to help the ensemble generalize well, allowing our approach to further optimize through ensembles. I.4 Synthetic Datasets In this section, we use synthetic datasets to illustrate how our RAI algorithms converge, and how different constraints on W translate into performance across various responsible metrics. We use the following distributions to construct the datasets, and use class labels as the group labels. Dataset-I: P(X|Y = 0) = N((0, 0), I), P(X|Y = 1) = 1 3N(( 3, 1), I)) + 1 3N((3, 0), I)) + 1 3N((0, 3), I)), P(Y = 0) = 0.7, P(Y = 1) = 0.3. Dataset-II: P(X|Y = 0) = 5 12N(( 2, 2), 0.5I) + 2 12N(( 2, 2), 0.5I) + 5 12N((2, 2), 0.5I), P(X|Y = 1) = 2 5N(( 3, 0), 0.3I)) + 3 5N((3, 0), 0.3I)), P(Y = 0) = 0.7, P(Y = 1) = 0.3 We sample 1000 points each for both training and testing from both distributions. Note that these datasets deliberately exhibit: 1. Class imbalance (particularly in Dataset-I) 2. Multiple minority sub-populations (within and between classes) 3. Varying noise levels in the sub-populations (predominantly in Dataset-II). Such characteristics are frequently encountered in real-world scenarios and demand responsible classifiers. Models For base learners, we use linear classifiers for Dataset-I and neural network classifiers with a single hidden layer of size 4 and Re LU activations for Dataset-II. We find that base learners can be models with varying complexity. Hyperparameters Due to the limited size of the datasets, we forgo the warm-up stage. At every round, we run 1000 iterations with a mini-batch size of 32. We run α-LPBoost with the default value η = 1. For α-CVa R experiments, we take α = 0.7 across all experiments. For lower values of α, we observe similar results and comparisons, albeit with a substantial reduction in average metrics. Consequently, we opt for conservative values to standardize average performance across all models and subsequently compare worst-case performance in responsible settings. I.4.1 Results and Discussion Domain Oblivious (DO) To begin, we run ERM, Ada Boost, α-LPBoost, and RAI games on Dataset-I. For RAI-GA and RAI-FW games, we use α-CVa R uncertainty set as W. Given the class imbalance, Y = 0 and Y = 1 represent good candidates for subpopulations of interest. The results are reported in Figure 1. We immediately observe the following: Both proposed methods RAI-FW and RAI-GA effectively decrease the objective value and achieve lower worst-class classification loss, as compared to both ERM and Ada Boost. They closely follow the α LPBoost iterates. Intuitively, our quasi-boosting updates resemble α LPBoost for the CVa R objective, and that is reflected in similar objective values. Domain Aware (DA) For this setting, we run ERM, Ada Boost, Online GDRO, RAI-GA, and RAI-FW on Dataset-II. We use Group DRO over the five gaussian groups as the uncertainty set W. Although Dataset-II was selected due to the presence of more pronounced subpopulation behavior, we get similar results for Dataset-I as well. The results are reported in Figure 2 and Table 5. Partially Domain-Aware (PDA) For this setting, we run our algorithms for Dataset-II. Similar to gaussian memberships, the class labels Y provide another secondary definition of implicit grouping in the dataset. We report the results in Table 5. A critical observation from the DA setting results is that Online GDRO, and RAI (Group) all exhibit inferior performance according to the secondary class definition i.e. although they optimize for the known groups (gaussian), they fail to optimize for unknown groups (class labels). Thus, a natural solution is to run RAI updates over the intersection Algorithm Synthetic Datset-I Average Loss Worst Class Loss ERM 27.3 82.6 Ada Boost 26.5 27.7 α-LPBoost 23.5 23.7 RAI-GA 23.9 24.4 RAI-FW 23.9 24.2 0 1 2 3 4 5 T (number of rounds) CVa R Objective Ada Boost (train) Ada Boost (test) RAIGame-FW (train) RAIGame-FW (test) LPBoost (train) LPBoost (test) RAIGame-GA (train) RAIGame-GA (test) ERM (test) ERM (train) Figure 1: Results for Dataset-I. Left: Average and worst class losses for baselines and proposed RAI updates. Right: α-CVa R objective values vs number of rounds for train and test splits 6 4 2 0 2 4 6 x1 | ERM | Ada Boost | Online GDRO | RAI-GA | RAI-FW class 0 class 1 1 2 3 4 5 T (Number of rounds) Group DRO Objective ERM (test) Ada Boost (test) Online GDRO (test) RAI-GA (test) RAI-FW (test) Figure 2: Results for Dataset-II. Left: Visualization of classifiers learned. For ensembles, we derandomized them using Definition 3. RAI methods (and Online GDRO) effectively prioritize minority and noisy instances (orange). Right: Testing Group DRO objective values vs the number of rounds (train values are similar and omitted to improve figure clarity). Quantitative results are reported in Table 5 of χ2 (for unknown groups) and Group (for known groups) constraints. As seen in the Table, we see that both RAI-GA and RAI-FW achieve a middle ground by significantly improving worst-case performance for both known and unknown groups. Algorithm Synthetic Datset-II Group 1 Group 2 Group 3 Group 4 Group 5 Worst Group Average Worst Class ERM 0.0 3.1 15.2 22.9 1.6 22.9 5.5 5.7 RAI-GA (χ2) 3.1 5.6 10.1 13.3 2.5 13.3 5.3 5.9 RAI-FW (χ2) 2.1 4.7 8.9 13.4 2.9 13.4 4.9 5.1 Online GDRO 5.1 3.7 5.8 10.2 5.6 10.2 6.1 6.2 RAI-GA (Group) 5.0 4.2 7.4 10.3 3.2 10.3 5.4 5.7 RAI-FW (Group) 10.0 4.7 6.5 9.5 5.5 10.0 6.9 7.5 RAI-GA (χ2 Group) 3.4 4.0 7.3 10.3 3.4 10.3 5.1 5.5 RAI-FW (χ2 Group) 4.7 4.9 9.2 10.6 2.6 10.6 5.2 6.1 Table 5: Average, worst group, and worst class losses for Synthetic Dataset-II. All the algorithms listed after Online GDRO have access to group information.