# differentially_private_worstgroup_risk_minimization__f91d34ec.pdf

Differentially Private Worst-group Risk Minimization

Xinyu Zhou 1 Raef Bassily 2

We initiate a systematic study of worst-group risk minimization under (ϵ, δ)-differential privacy (DP). The goal is to privately find a model that approximately minimizes the maximal risk across p sub-populations (groups) with different distributions, where each group distribution is accessed via a sample oracle. We first present a new algorithm that achieves excess worst-group population risk of O( p

K ), where K is the total number of samples drawn from all groups and d is the problem dimension. Our rate is nearly optimal when each distribution is observed via a fixedsize dataset of size K/p. Our result is based on a new stability-based analysis for the generalization error. In particular, we show that -uniform argument stability implies O( + 1 n) generalization error w.r.t. the worst-group risk, where n is the number of samples drawn from each sample oracle. Next, we propose an algorithmic framework for worst-group population risk minimization using any DP online convex optimization algorithm as a subroutine. Hence, we give another excess

risk bound of O q

ϵK + p p Kϵ2 + p p

suming the typical setting of ϵ = Θ(1), this bound is more favorable than our first bound in a certain range of p as a function of K and d. Finally, we study differentially private worst-group empirical risk minimization in the offline setting, where each group distribution is observed by a fixed-size dataset. We present a new algorithm with nearly optimal excess risk of O( p

1Department of Computer Science & Engineering, The Ohio State University 2Department of Computer Science & Engineering and the Translational Data Analytics Institute (TDAI), The Ohio State University. Correspondence to: Xinyu Zhou <zhou.3542@buckeyemail.osu.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

1. Introduction

Multi-distribution learning has gained increasing attention due to its close connection to robustness and fairness in machine learning. Consider p groups (or, sub-populations), where each group is associated with some unknown data distribution. In the multi-distribution learning paradigm, the learner tries to find a model that minimizes the maximal population risk across all groups. Let Di denote distribution of group i [p] and ℓ: W Z be a loss function over the parameter space W and data domain Z, our objective of minimizing the worst-group population risk can be formulated as a minimax stochastic optimization problem written as min w W max i [p] {LDi(w) := Ez Diℓ(w, z)} (1)

When each group distribution Di is observed by a dataset Si, we also define the worst-group empirical risk minimization problem as

min w W max i [p]

LSi(w) := 1 |Si|

z Si ℓ(w, z)

The worst-group risk minimization problems (1) and (2) have wide applications in a variety of learning scenarios. In the regime of robust learning, the objective represents a class of problems named as group distributionally robust optimization (Group DRO) (Soma et al., 2022; Zhang et al., 2023; Sagawa et al., 2019). This formulation also captures the mismatch between distributions from different domains (or, sub-populations). In this type of scenarios, instead of assuming a fixed target distribution, this formulation can be used to find a model that can work well for any possible target distribution formed as a mixture of the distributions from different domains/sub-populations. From the perspective of learning with fairness, each group may represent a protected class or a demographic group. Minimizing the worst-group risk leads to the notions of good-intent fairness (Mohri et al., 2019) or min-max group fairness (Abernethy et al., 2022; Diana et al., 2021; Papadaki et al., 2022). By making sure the worst-off group performs as good as possible, the objective prevents the learner from overfitting to certain groups at the cost of others. Moreover, the objectives are also connected to other learning applications such as collaborative learning (Blum et al., 2017) and agnostic fed-

Differentially Private Worst-group Risk Minimization

erated learning (Mohri et al., 2019) where the output model is optimized to perform well across multiple distributions.

Meanwhile, machine learning algorithms typically depend on large volumes of data, which may pose serious privacy risk, particularly in certain privacy-sensitive applications like healthcare and finance. Therefore, it is important that we provide a strong and provable privacy guarantee for such algorithms to ensure that the sensitive information always remains private during the learning process. Although there has been a significant progress in understanding worst-group risk minimization in the non-private setting (Soma et al., 2022; Haghtalab et al., 2022; Zhang et al., 2023), this problem has not been formally studied in the context of learning with provable privacy guarantees.

Motivated by this, in this paper, we initiate a formal study of worst-group risk minimization under (ϵ, δ)-differential privacy (DP). We consider the setting where the loss function is convex and Lipschitz over a compact parameter space W Rd. For the sample access model, each group distribution is accessed via a sample oracle and the learner is able to sample new data points from any group distribution via its sample oracle as needed during the learning process.

1.1. Contribution

As we mentioned earlier, we let p denote the number of groups , K denote the total number of samples drawn from all groups and d denote the problem dimension.

We first provide a new algorithm adapted from the phased ERM approach in (Feldman et al., 2020) with the excess worst-group population risk of O p p

d Kϵ . The first term in this bound matches the optimal non-private bound (Soma et al., 2022) and the second term represents the cost of privacy. Our rate is optimal up to logarithmic factors in the offline setting where each distribution is observed via a a fixed-size dataset of size K/p. The utility guarantee of our algorithm relies on a new stability based argument to establish the generalization error bound. Roughly speaking, we show that -uniform argument stability (see Definition 1) with respect to the parameter w implies a O( + 1/ n) generalization error where n is the number of samples drawn from each sample oracle. This is distinct from existing generalization results based on uniform convergence (Mohri et al., 2019; Abernethy et al., 2022), which lead to suboptimal rates for general convex losses. On the other hand, our result also circumvents a known bottleneck in deriving generalization guarantees using argument uniform stability in the stochastic saddle-point problems. In particular, in the L2/L2 setting, the best known generalization guarantee for a -uniform argument stable algorithm is

generalization error based on the analysis of the strong duality gap (Ozdaglar et al., 2022).

Our algorithm is akin to the Phased ERM approach of (Feldman et al., 2020), which was introduced to solve private stochastic convex optimization. We repurpose this approach for the private stochastic minimax optimization. In particular, we define a sequence of regularized minimax problems, which are solved iteratively using any generic non-private solver. Our version of this algorithm requires different tuning of parameters across iterations and, more crucially, a new analysis that utilizes the uniform argument stability of the solutions of the regularized minimax problems.

As a side product, our stability result naturally leads to a new non-private regularization based method which matches the non-private lower bound in (Soma et al., 2022) up to logarithmic factors. This may be of independent interest because existing non-private methods with (nearly) optimal rate rely on single-pass first order method rather than regularization.

Next, we give a framework to minimize the worst-group population risk using a black-box access to any DP online convex optimization (DP OCO) algorithm as a subroutine. Our framework leverages the idea in (Soma et al., 2022; Haghtalab et al., 2022) on casting the worst-group risk minimization objective in a zero-sum game and decomposing the excess worst-group risk into the expected regret of both players. By instantiating our framework with the DP-FTRL algorithm from (Kairouz et al., 2021), we obtain another bound on the excess worst-group population

risk of O q

ϵK + p p Kϵ2 + p p

. Assuming the typi-

cal setting for the privacy parameter ϵ, namely, ϵ = Θ(1), this bound is more favorable than our bound based on the Phased ERM approach in a certain range of p as a function of K and d, particularly, when p max{

d, K/d} or q

Finally, we investigate the worst-group empirical risk minimization problem (2) in the offline setting, where each distribution Di is observed by a fixed-size dataset Si. We present a new algorithm based on a private version of the multiplicative group reweighing scheme (Abernethy et al., 2022; Diana et al., 2021). We show that our algorithm achieves a nearly optimal excess worst-group empirical risk of O p

1.2. Related Work

The worst-group risk minimization is closely related to the group DRO problem. In (Sagawa et al., 2019), the authors consider solving the empirical objective in the offline setting and propose an online algorithm based on the stochastic mirror descent (SMD) method from (Nemirovski et al., 2009). The convergence rate achieved in (Sagawa et al., 2019) is suboptimal because of the high variance of the gradient estimators. On the other hands, several works (Haghtalab et al.,

Differentially Private Worst-group Risk Minimization

2022; Soma et al., 2022; Zhang et al., 2023) has studied the group DRO problem in the stochastic oracle setting and achieves an excess worst-group population risk of O p p

The basic idea behind the methods used in these works is to cast the objective into a zero-sum game based on the observation that the optimality gap depends on the expected regret of the max and min players. The authors in (Soma et al., 2022) demonstrate the tightness of this rate by proving a matching lower bound result.

The worst-group risk minimization problem is also extensively studied under the notion of min-max group fairness. Reference (Diana et al., 2021) studies learning with minmax group fairness and propose a multiplicative reweighting based method. However, their method assumes the access to a weighted empirical risk minimization oracle which may not be realistic in practice. Meanwhile, reference (Abernethy et al., 2022) studies a similar problem and present more efficient gradient based methods to achieve the minmax group fairness based on the idea of active sampling. The problem of min-max group fairness is also studied in (Mohri et al., 2019) under the federated learning setting. However, all these works do not provide any privacy guarantees.

Our work is also related to the line of works on differentially private stochastic saddle point (DP-SSP) problem. References (Yang et al., 2022; Zhang et al., 2022) have presented algorithms with the optimal rate for the weak duality gap, while the optimal rate for the strong duality gap is achieved in (Bassily et al., 2023). However, these works only consider the objective with L2/L2 geometry, directly applying the results from existing DP-SSP works will lead to a suboptimal convergence rate due to the dependence of p on the L2 Lipschtiz constant for the model parameters.

2. Preliminaries

We consider p groups, where each group i [p] corresponds to a distributions Di. Given a parameter space W Rd

and a loss function ℓ, we let ℓ(w, z) be the loss incurred by a weight vector w on a data point z. We assume the L2 diameter of W, W 2, is bounded by M. Throughout this paper, we assume that the loss function ℓis convex, L-Lipschitz and bounded by [0, B] unless stated otherwise . The population loss of distribution Di is written as LDi(w) = Ez Diℓ(w, z). When each group distribution Di is observed by a dataset Si with i.i.d samples drawn from Di, we denote the empirical loss evaluated on Si as LSi(w) = 1 |Si| P

z Si ℓ(w, z).

Sample oracle: A sample oracle Ci w.r.t distribution Di, i [p] returns an i.i.d sample z drawn from Di. We denote the collection of sample oracles from all groups as C = {C1, . . . Cp}.

α-saddle point: We say ( x, y) is an α-saddle point of a minimax problem minx X maxy Y ϕ(x, y) if ϕ( x, y) ϕ(x, y) α for any x X and y Y. When α = 0, we simply call ( x, y) a saddle point of ϕ.

Worst-group population risk minimization: Given p groups that each of them is with a distribution Di, we define the expected (population) worst-group risk of model w as the maximal population loss of w across all group distributions written as

R(w) = max i [p] LDi(w)

Our objective of minimizing the worst-group risk is therefore written as min w W max i [p] LDi(w) (3)

Denote p = {λ [0, 1]p : λ 1 = 1} as the probability simplex over p dimensions and ϕ(w, λ) = Pp i=1 λi LDi(w).

Then the objective in (3) can be equivalently written as

min w W max λ p ϕ(w, λ) (4)

The equivalence is based on the fact that for any w W, ϕ(w, λ) is maximized by a λ with all dimensions being zero except the one with the highest LDi(w). We also define the excess worst-group population risk of w W as

E(w; {Di}p i=1) max i [p] LDi(w) min w W max i [p] LDi( w)

Differential Privacy (Dwork et al., 2006): A mechanism M is (ϵ, δ)-DP if for any two neighboring datasets S and S that differ on single data point and any output set O, we have

P (M(S) O) eϵP (M(S ) O) + δ

We focus on record-level privacy throughout the paper. We say that a pair of sequences of sampled data points {z1, . . . z T } and {z 1, . . . z T } are neighbors if, for some j [T], zi = z i for all i [T] \ {j} and zj = z j.

Stability: Our analysis depends on the notion of uniform argument stability defined as follows

Definition 1. Consider a randomized algorithm A that takes a collection of datasets S as input and outputs (Aw( S), Aλ( S)) as the solution to a minimax problem minw W maxλ ϕ(w, λ). Then A is said to attain - uniform argument stability with respect to w if for any pairs of adjacent S and S that differs in a single data point, we have EA h Aw( S) Aw( S ) i

Differentially Private Worst-group Risk Minimization

3. Minimax Phased ERM

In this section, we propose an algorithm attaining an excess worst-group population risk of O p p

d Kϵ , where p and K denote the number of groups and the total number of samples drawn from all groups, respectively. Our algorithm involves iteratively solving a sequence of regularized minimax objectives. In iteration t, given the previous iterate wt 1 and a dataset collection {St i}p i=1 where St i is the dataset formed by the data points drawn from the sample oracle Ci at current iteration, we find an approximate saddle point ( wt, λt) of the regularized objective

i=1 λi LSt i (w) + µt w w wt 1 2 2 µt λ

j=1 λj log λj

and output wt. We can show that wt has low sensitivity w.r.t. the collection of samples drawn from all groups , and hence, we can use the standard Gaussian mechanism to ensure that the computation of wt is differentially private.

Proving the utility guarantee of our algorithm involves new stability based argument. First, given a dataset collection S = {S1, . . . Sp} where each Si Dn i is a dataset formed by n samples drawn from the sample oracle Ci. we show that an algorithm that outputs an approximate saddle point ( w, λ) for the regularized objective F(w, λ) = Pp i=1 λi LSi(w) + µw

2 w w 2 2 µλ Pp j=1 λj log λj has uniform argument

stability of O L nµw + B n µwµλ

with respect to w. Then we prove that -uniform argument stability of w leads to a O + B n generalization error bound.

Our stability-based generalization argument is distinct from existing generalization analysis based on uniform convergence (Mohri et al., 2019; Abernethy et al., 2022), which leads to sub-optimal excess worst-group risk for general convex losses even in the non-private regime. On the other hand, our result circumvents a known bottleneck in the relationship between stability and generalization in stochastic minimax optimization (also, known as Stochastic Saddle Point (SSP) problems). In particular, in the L2/L2 setting of the SSP problem, the best known generalization guarantee using a -argument stable algorithm is

generalization error based on the analysis of the strong duality gap (Ozdaglar et al., 2022).

3.1. Algorithm description

Our algorithm relies on a non-private subroutine to solve a regularized minimax problem. More concretely, let S = {S1, . . . Sp} be a collection of datasets. Let µw, µλ > 0 and w W. Let Aemp( S, µw, µλ, w , α) be an empirical

minimax solver that computes an α-saddle point ( w, λ) of

min w W max λ p

F(w, λ) (5)

i=1 λi LSi(w) + µw

2 w w 2 2 µλ

j=1 λj log λj

and outputs w.

The formal description of our algorithm is given in Algorithm 1. Our algorithm is inspired by the Phased ERM approach of (Feldman et al., 2020), which was used to attain the optimal rate for the simpler problem of differentially private stochastic convex optimization. We repurpose the Phased ERM approach for the stochastic minimax problem. Our algorithm involves several modifications to this approach that enable us to attain strong guarantees for the worst-group population risk minimization problem. Crucially, we provide a new analysis for our algorithm that utilizes the argument uniform stability of the regularized minimax problems (see, Lemma 4).

In Algorithm 1, a sequence of regularized empirical minimax problems are defined and solved iteratively using the non-private minimax solver Aemp (defined earlier in this section). The total number of such problems (the number of rounds in the algorithm) T is logarithmic in K/p. In round t [T], we first sample a dataset of size nt from each distribution to construct a new dataset collection St. The center of the regularization term is chosen to be w = wt 1, i.e., the previous iterate. The settings of the regularization parameters in each round are carefully chosen to ensure convergence to a nearly optimal rate. The regularization parameter µλ is fixed across all rounds whereas µw is doubled with every round. Our main result is stated below.

Theorem 2. Algorithm 1 is (ϵ, δ)-differentially private. Let

log 3 4 (K)

E max i [p] LDi(w T ) min w W max i [p] LDi(w)

K + log 5 2 (K)p p

d log(1/δ) Kϵ

where w T is the output of Algorithm 1 and the expectation is taken over the sampled data points and the algorithm s inner randomness.

One can show that the rate in Theorem 2 is nearly optimal in the offline setting considered in Section 5, where each distribution is observed by a fixed-size dataset of size K/p. A lower bound instance of Ω p

Differentially Private Worst-group Risk Minimization

Algorithm 1 Minimax Phased ERM

Input Sample oracles {C1, . . . Cp}, step size η, D = max{L, B}, total number of samples drawn from all groups K, non-private empirical solver Aemp. Initialize w0 arbitrary parameter vector in W. Set n = K/p and T = log2(n). for t = 1, . . . , T do

Let nt = n/T, ηt = η2 t, µt w = 1/(ηtnt), µt λ = 1/(ηn). For each i [p], sample St i Dnt i using the sample oracle Ci. Denote St = {St 1, . . . St p}.

Let αt = L2

8n2 t µtw + B2

8n2 t µt λ .

Let wt = Aemp( St, µt w, µt λ, wt 1, αt). Set wt = wt + ξt where ξt N(0, σt Id) and σt = 6D p

2 log2(n) log(1/δ)ηηt/ϵ. end for Output: w T .

be constructed in this case. More details are given in Appendix D. Also, note that when d = O(K/p), the rate in Theorem 2 matches the the non-private lower bound Ω p p

K up to logarithmic factors.

In Appendix A.1, we give an instantiation of Aemp, which is an iterative algorithm for solving the regularized minimax problem in (5) with a convergence rate of O(1/N), where N is the number of iterations.

The privacy and utility guarantees in Theorem 2 rely on a stability-based result of the saddle point of the regularized objective (5), which we present in the following lemma.

Lemma 3. Consider two neighboring dataset collections S = {S1, . . . Sp} Zn p and S = {S 1, . . . S p} Zn p

that differs in a single data point. Let ( w, λ) and ( w , λ ) be an α-saddle point of F(w, λ) in (5) when the dataset collections are S and S , respectively. By letting α L2 8n2µw + B2 8n2µλ , we have

µw + B µwµλ

The proof of Lemma 3 follows from Lemma 2 in (Zhang et al., 2021) as well as the observation that F(w, λ) can be equivalently written in the finite-sum form composed with regularization terms. We give full details in Appendix A.3.

Now we are ready to present our main technical lemma, which shows that an approximate saddle point of the regularized objective F(w, λ) in (5) gives a good solution to the worst-group population risk minimization problem. We will sketch the proof of this lemma and defer the full proof to Appendix A.4.

Lemma 4. Let ( w, λ) be an α-saddle point of F(w, λ) in (5) with α L2 8n2µw + B2 8n2µλ . For any w W,

E max i [p] LDi( w) max i [p] LDi(w)

= O µw w w 2 2 + µλ log p

nµw + LB n µwµλ

log(n) log(np) + B p

where the expectation is taken over the randomness in the datasets S.

Proof. (sketch) For simplicity, we let α = 0, which means ( w, λ) is the exact saddle point of F(w, λ). In particular, by the definition of saddle point, one can show that for any w W

max i [p] LSi( w) ˆϕ(w, λ) µw

2 w w 2 2 + µλ log p

where ˆϕ(w, λ) = Pp i=1 λi LSi(w).

By Lemma 3 and Lipschitzness, for any given i [p], com-

puting w has a uniform stability of γ = 3

n L2 µw + LB µwµλ

with respect to the change of datapoint in Si. By Theorem 1.1 in (Feldman & Vondrak, 2019), we have with high probability over the sampling of S,

|LSi( w) LDi( w)| = O γ + B n

Then we can use union bound across all i [p] and obtain with high probability

|LSi( w) LDi( w)| = O γ + B n

In particular, we have with high probability

| max i [p] LDi( w) max i [p] LSi( w)| = O γ + B n

Meanwhile, we can show that with high probability, |ϕ(w, λ) ˆϕ(w, λ)| = O B n for any given w W. Finally, we put everything together and have with high probability

max i [p] LDi( w) max i [p] LDi(w)

max i [p] LDi( w) ϕ(w, λ)

max i [p] LSi( w) ˆϕ(w, λ) + O γ + B n

= O µw w w 2 2 + µλ + γ + B n

Differentially Private Worst-group Risk Minimization

Replacing γ with 3

n L2 µw + LB µwµλ

and we obtain the desired result.

Remark: As a side product of Lemma 4, we can sample a new dataset collection S = {S1, . . . Sp} where Si DK/p i . By letting µw = D

log(K/p) log(K),

p log(K/p) log(K)

K log p and w be any arbitrary parame-

ter in W, solving the regularized objective (5) with S gives an excess worst-group population risk of O p p

K , which matches the non-private lower bound in (Soma et al., 2022) up to logarithmic factors. Prior methods (nearly) matching the non-private lower bound depend on making one pass over the sampled data points to directly optimize the population objective, our regularization-based method however does not require such one pass structure.

Now, we are ready to give a proof sketch of Theorem 2. The full proof is deferred to Appendix A.5.

Proof. (Proof Sketch of Theorem 2)

Privacy: At iteration t, by the stability argument in Lemma 3, we have

wt wt 2 3L ntµtw + 3B nt p

= 3Lηt + 3B ntnηtη

(log2 n)ηtη

where wt and wt are outputs from neighboring dataset collections St and St that differ in one datapoint. Since the sampled minibatches St are disjoint across iterations, the privacy of Algorithm 1 follows from the privacy guarantee of the Gaussian mechanism and parallel composition.

Utility: Recall that R(w) = maxi [p] LDi(w). By Lemma 4, we have

E[R( wt) R( wt 1)]

E[ ξt 1 2 2] ntηt + 1

ηn + D2 ηηt + B nt

Also, we have

E ξt 2 2 = dσ2 t = 72d D2 log n log(1/δ)2 tη2/ϵ2

Therefore, as long as

72d log n log(1/δ)

We will have

E ξt 2 2 2 t M 2 and E ξt 2

Let w0 = w and ξ0 = w0 w . Then ξ0 2 M. Hence,

E[R(w T )] R(w )

t=1 E[R( wt) R( wt 1)] + E[R(w T ) R( ˆw T )]

E[ ξt 1 2 2] nηt + 1

ηn + D2 ηηt + B n

+ LE[ ξT 2]

The inequality holds since R( ) is L-Lipschitz and nt = n/ log2(n).

E[ ξt 1 2 2] ntηt 2 (t 1)M 2 log2 n

n2 tη = 2(log2 n)M 2

and E[ ξT 2] M

2 log2 n = M n.

E[R(w T )] R(w )

ηn + D2η + B n

Replace n with K/p and we get the desired result.

4. Worst-group Risk Minimization using DP OCO Algorithm

In this section, we give a framework to directly minimize the expected worst-group loss in (4) using any DP online convex optimization (OCO) algorithm as a subroutine.

Our framework leverages the idea from (Haghtalab et al., 2022; Soma et al., 2022; Zhang et al., 2023) of casting the objective into a two-player zero-sum game. One can show that the excess expected worst-group risk is bounded by the sum of the expected regret bounds of the min-player (the wplayer) and max-player (the λ-player) using stochastic gradient estimation. Therefore, given any DP-OCO algorithm Q , our frameworks proceed by letting the w-player run Q with an estimate of the loss function ϕ(w, λt). Meanwhile, we instantiate the λ-player as EXP3 (Auer et al., 2002), an adversarial multi-armed bandit algorithm and feed it with a privatized estimator of λϕ(wt, λt).

An online convex optimization algorithm interacts with an adversary for T rounds. In each round, the algorithm picks a vector wt and then the adversary chooses a convex loss function ℓt. The performance of an OCO algorithm Q is

Differentially Private Worst-group Risk Minimization

therefore measured by the regret defined as

t=1 ℓt(wt) min w W

Definition 5. An online algorithm M is (ϵ, δ)-DP if for any two sequence of loss functions S = (ℓ1, . . . ℓT ) and S = (ℓ 1, . . . ℓ T ) that differs in one loss function, we have

P(M(S) O) eϵP(M(S ) O) + δ

Suppose we have a DP OCO algorithm Q with regret r Q (T) that observes lt = ℓ( , xt) for some data point xt in each iteration, the formal description of our framework is given in Algorithm 2. We give the privacy and utility guarantee of Algorithm 2 in the following theorems.

Algorithm 2 Worst-group risk minimization using DP-OCO algorithm

Input Sample oracles {C1, . . . Cp}, DP OCO algorithm Q , learning rate η, privacy parameters ϵ, δ. Initialize w1 W and λ1 = (1/p, . . . 1/p) [0, 1]p. Set U = B + 2B

ϵ log(T). for t = 1, . . . , T 1 do

Sample it λt. Sample x t , x+ t Dit using the sample oracle Cit. Update wt+1 = Q (w1:t, x t ). Compute ℓt = U ℓ(wt, x+ t ) + yt, yt Lap(B/ϵ).

Update λt+1,it = λt,itexp η ℓt λt,it

Normalize λt+1 = λt+1 λt+1 1 . end for Output w T = 1

T PT t=1 wt, λT = 1

T PT t=1 λt.

Theorem 6. (Privacy Guarantee) Algorithm 2 is (ϵ, δ)- differentially private.

Proof. The sampled dataset can be divided into two disjoint sets. S = {S , S+} where S = {x 1 , . . . x T } and S+ = {x+ 1 , . . . x+ T }. Now, we proceed inductively. When t = 1, it is easy to see that the generation of (w1, λ1) is (ϵ, δ)-DP.

Now we assume that the generation of {(wi, λi)}t i=1 is (ϵ, δ)-DP for iteration t. At iteration t + 1, we observe that the computation of wt+1 only depends on S and {wi}t i=1, hence wt+1 can be generated under (ϵ, δ)-DP constraint due to the privacy guarantee of Q .

Furthermore, since ℓis uniformly bounded by B, then clearly the sensitivity of ℓ(wt, ) w.r.t. replacing one data point in the input sequence is bounded by B. Hence, by the properties of the Laplace mechanism, ℓt is also generated in (ϵ, 0)-DP manner. Given that λt+1 depends on only ℓt and

λt and since DP is closed under postprocessing, computing λt+1 is (ϵ, δ)-DP. Since in iteration t, two different data points (namely, x t and x+ t ) are used to generate wt+1 and λt+1, then by parallel composition (and given the induction hypothesis), generating (wt+1, λt+1) is (ϵ, δ)-DP.

Theorem 7. (Utility Guarantee) Suppose the regret of Q is denoted as r Q ( ). By setting η = q

ln(p) p T U2 , we have

E max i [p] LDi( w T ) min w W max i [p] LDi(w)

= r Q (K/2) + O

B + B log(K)

where w T is the output of Algorithm 2 and the expectation is over the sampling of data points and the algorithm s randomness.

The proof of Theorem 7 can be found in Appendix B.1. In particular, by instantiating the DP OCO algorithm Q with the DP-FTRL algorithm (Kairouz et al., 2021), we have the following corollary.

Corollary 8. Let Q in Algorithm 2 be DP-FTRL algorithm from (Kairouz et al., 2021). By plugging the regret bound of DP-FTRL into Theorem 7, we have

E max i [p] LDi( w T ) min w W max i [p] LDi(w)

d 1 2 log2(1/δ) log(K)

+ B + B log(K)

where the expectation is over the sampling of data points and the algorithm s randomness.

Remark: Corollary 8 demonstrates an excess expected

worst-group risk of O q

ϵK + p p Kϵ2 + p p

common case of ϵ = Θ(1), the rate in Corollary 8 matches the lower bound of Ω p p

K shown in (Soma et al., 2022) up to logarithmic factors in the regime of d = O p2 .

5. Private Worst-group Empirical Risk Minimization

In this section, we consider the offline setting where each distribution Di is observed by a fixed-size dataset Si with i.i.d samples drawn from Di. In the offline setting, the learner will take the dataset collection S = {S1, . . . Sp} as input instead of querying the sample oracles directly. This

Differentially Private Worst-group Risk Minimization

setting captures scenarios where each group is represented by a dataset whose size is fixed beforehand.

In particular, we denote the dataset collection as S = {S1, . . . Sp}. Without loss of generality, we assume all datasets Si have same size, namely |Si| = n for all i [p]. Note that n can also be written as n = K/p in this case. We first define the worst-group empirical risk minimization problem as follows.

Worst-group empirical risk minimization: Given a dataset collection S = {S1, . . . Sp} Zn p, we denote ˆϕ(w, λ) = Pp i=1 λi LSi(w). The worst-group empirical loss can be expressed as maxi [p] LSi(w) = maxλ p ˆϕ(w, λ). The empirical objective is hence written as

min w W max λ p ˆϕ(w, λ) (6)

We define the excess worst-group empirical risk of w W as ˆE(w; {Si}p i=1) maxi [p] LSi(w) min w W maxi [p] LSi( w).

Next, we will present a method described in Algorithm 3 to solve the empirical objective (6) under the differential privacy constraint. Algorithm 3 is based on a private version of the multiplicative reweighting scheme (Abernethy et al., 2022; Diana et al., 2021) and attains nearly optimal excess empirical worst-group risk.

5.1. Private Multiplicative Group Reweighting

Here, we provide the details of our algorithm described formally in Algorithm 3. In this algorithm, we maintain a sampling weight vector λ [0, 1]p for all datasets {Si}p i=1. In each iteration, we first sample a dataset according to λ. Next, we sample a mini-batch from this dataset and update the model parameters using Noisy-SGD. We then update the group weight vector λ using the multiplicative weights approach based on privatized loss values computed over the datasets at the model parameters. We present the privacy and utility guarantee of Algorithm 3 in the following theorems.

Theorem 9. (Privacy guarantee) In Algorithm 3, let τ = c Bp

T log(1/δ) and σ2 = c T p2L2 log(T/δ) log(1/δ)

K2ϵ2 for some universal constant c. Then, Algorithm 3 is (ϵ, δ)-DP.

Theorem 10. (Convergence guarantee) There exist set-

tings of T = O (ML+B

log(p))K2ϵ2

GBdp2 log(1/δ)

O M 2 T (G2+dσ2) and ηλ = O q

, where U =

Algorithm 3 Noisy SGD with Multiplicative Group Reweighting (Noisy-SGD-MGR)

Input: Collection of datasets S = {S1, . . . Sp} Zn p, mini-batch size m, # iterations T, learning rates ηw and ηλ, privacy parameters ϵ, δ, and noise scales σ2, τ. Initialize w1 W and λ1 = 1 p, . . . 1

p [0, 1]p. for t = 1, . . . , T 1 do

Sample it λt. Sample Bt = {z1, . . . zm} from Sit uniformly with replacement. Update the model as follows

wt+1 = Proj W

z Bt ℓ(wt, z) + Gt

where Gt N(0, σ2Id). Compute Lt = [ LSi(wt) + yi,t]q i=1, yi,t iid Lap(τ). Update the weights λi t+1 = λi t exp( ηλLi t), i [p].

Normalize λt+1 = λt+1 λt+1 1 .

end for Output: w T = 1

T PT t=1 wt, λT = 1

T PT t=1 λt.

T log(1/δ) log(KT) such that

E max i [p] LSi( w T ) min w W max i [p] LSi(w)

d log(1/δ) log(K/δ)

log(p) log(1/δ) log( Kϵ d log(1/δ))

where the expectation is over the algorithm s randomness.

The convergence rate in Theorem 10 is nearly optimal up to logarithmic factors. A lower bound instance can be created to reduce the original problem to a DP-ERM problem with single dataset, which leads to a lower bound of Ω p

d Kϵ (Bassily et al., 2014). A more detailed argument for the lower bound instance can be found in Appendix D.

(Abernethy et al., 2022) proposes a gradient method based on the active group selection scheme. One can also design a private algorithm based on this scheme and achieve a

suboptimal rate of O q

the details to Appendix C.2.

Our results for the offline setting can be readily extended to the case where the dataset sizes are non-uniform. In particular, when different groups are associated with datasets

Differentially Private Worst-group Risk Minimization

of different sizes, we find the group with the minimum dataset size, which we denote as nmin. For the worstgroup population risk, we can simply replace n with nmin in Algorithm 1 and sample without replacement from the dataset of each group. This gives us an excess worstgroup population risk of O(

d nminϵ + 1 nmin ). For the worst-group empirical risk, we can let the added noise scale with nmin instead of n in Algorithm 3, which gives an excess worst-group empirical risk of O(

d nminϵ). Both rates can be shown to be nearly optimal by constructing a matching lower bound instance (similar to our construction in Appendix D) in which the group with the minimum dataset size has a higher population or empirical risk than the other groups.

6. Conclusion

We presented differentially private algorithms for solving the worst-group risk minimization problem. In particular, we gave two upper bounds on the excess worst-group population risk, one of them is tight in the offline setting and we established the optimal rate for the excess worst-group empirical risk. Establishing the optimal rate for the worstgroup population risk in general is left as an interesting open problem for future work.

Acknowledgements

This work is supported by NSF Award 2112471 and NSF CAREER Award 2144532.

Impact Statement

This work aims at advancing the theoretical understanding of privacy-preserving machine learning, which is an area of clear and well-established societal impacts. There is nothing on the societal impact of this work which we feel must be specifically highlighted here.

Abernethy, J. D., Awasthi, P., Kleindessner, M., Morgenstern, J., Russell, C., and Zhang, J. Active sampling for min-max fairness. In International Conference on Machine Learning, volume 162, 2022.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

Bassily, R., Smith, A., and Thakurta, A. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th annual symposium on foundations of computer science, pp. 464 473. IEEE, 2014.

Bassily, R., Feldman, V., Talwar, K., and Guha Thakurta, A. Private stochastic convex optimization with optimal rates. Advances in neural information processing systems, 32, 2019.

Bassily, R., Guzm an, C., and Menart, M. Differentially private algorithms for the stochastic saddle point problem with optimal rates for the strong gap. ar Xiv preprint ar Xiv:2302.12909, 2023.

Blum, A., Haghtalab, N., Procaccia, A. D., and Qiao, M. Collaborative pac learning. Advances in Neural Information Processing Systems, 30, 2017.

Diana, E., Gill, W., Kearns, M., Kenthapadi, K., and Roth, A. Minimax group fairness: Algorithms and experiments. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 66 76, 2021.

Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pp. 265 284. Springer, 2006.

Feldman, V. and Vondrak, J. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. In Conference on Learning Theory, pp. 1270 1279. PMLR, 2019.

Feldman, V., Koren, T., and Talwar, K. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp. 439 449, 2020.

Haghtalab, N., Jordan, M., and Zhao, E. On-demand sampling: Learning optimally from multiple distributions. Advances in Neural Information Processing Systems, 35: 406 419, 2022.

Kairouz, P., Mc Mahan, B., Song, S., Thakkar, O., Thakurta, A., and Xu, Z. Practical and private (deep) learning without sampling or shuffling. In International Conference on Machine Learning, pp. 5213 5225. PMLR, 2021.

Mohri, M., Sivek, G., and Suresh, A. T. Agnostic federated learning. In International Conference on Machine Learning, pp. 4615 4625. PMLR, 2019.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009.

Orabona, F. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019.

Differentially Private Worst-group Risk Minimization

Ozdaglar, A., Pattathil, S., Zhang, J., and Zhang, K. What is a good metric to study generalization of minimax learners? Advances in Neural Information Processing Systems, 35:38190 38203, 2022.

Papadaki, A., Martinez, N., Bertran, M., Sapiro, G., and Rodrigues, M. Minimax demographic group fairness in federated learning. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 142 159, 2022.

Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019.

Soma, T., Gatmiry, K., and Jegelka, S. Optimal algorithms for group distributionally robust optimization and beyond. ar Xiv preprint ar Xiv:2212.13669, 2022.

Yang, Z., Hu, S., Lei, Y., Vashney, K. R., Lyu, S., and Ying, Y. Differentially private sgda for minimax problems. In Uncertainty in Artificial Intelligence, pp. 2192 2202. PMLR, 2022.

Zhang, J., Hong, M., Wang, M., and Zhang, S. Generalization bounds for stochastic saddle point problems. In International Conference on Artificial Intelligence and Statistics, pp. 568 576. PMLR, 2021.

Zhang, L., Thekumparampil, K. K., Oh, S., and He, N. Bring your own algorithm for optimal differentially private stochastic minimax optimization. Advances in Neural Information Processing Systems, 35:35174 35187, 2022.

Zhang, L., Zhao, P., Yang, T., and Zhou, Z.-H. Stochastic approximation approaches to group distributionally robust optimization. ar Xiv preprint ar Xiv:2302.09267, 2023.

Differentially Private Worst-group Risk Minimization

A. Missing Proofs in Section 3

A.1. Non-private algorithm for the regularized objective

Here, we provide an non-private algorithm in solving the regularized minimax problem in (5) written as

min w W max λ p F(w, λ) :=

i=1 λi LSi(w) + µw

j=1 λj log λj

with an convergence rate of O (1/N) where N is the number of iterations.

Algorithm 4 Minimax Optimization for SC-SC Objective

Input: Collection of datasets S = {S1, . . . Sp}, regularization parameters µw, µλ > 0. Init: w1 W. for t = 1, . . . , N 1 do

Compute λt = exp(LS1(wt)/µλ), . . . exp(LSp(wt)/µλ) .

Let λt = λt λt 1 . Compute t = Pp i=1 λi t LSi(wt) + µw(wt w ) . Update wt+1 = Proj W(wt ηt t). end for Output: w N = 1

N PN t=1 wt and λN = 1

N PN t=1 λt .

We now provide the convergence guarantee on Algorithm 4.

Theorem 11. Let ηt = 1 µwt, we have

max λ p F( w N, λ) min w W F(w, λN) = O ln N

Proof. An important observation here is that in λt is the best response of function F(wt, ), that is,

λt = arg min λ p F(wt, λ)

Then by Corollary 11.16 in (Orabona, 2019), we have

max λ p F( w N, λ) min w W F(w, λN) Regretw N N (7)

where Regretw N = PN t=1 lt(wt) minw W PN t=1 lt(w) and lt(w) = F(w, λt). Since w is updated using online gradient descent and lt(w) is strongly convex, by Corollary 4.9 in (Orabona, 2019) and t 2 G + µw B, we have

Regretw N = O (L + µw M)2

µw (1 + ln N) = O ln N L2

Combining equations (7) and (8), we get the desired result.

A.2. Auxiliary Lemma

Lemma 12. Let ( ˆw, ˆλ) be the exact saddle point of (5) and ( w, λ) be an α-saddle point of (5). Then we have

ˆw w 2 r 2α

Differentially Private Worst-group Risk Minimization

Proof. Since ( w, λ) is an α-saddle point of F(w, λ), then

F( w, ˆλ) F( ˆw, λ) α

= F( w, ˆλ) F( ˆw, ˆλ) + F( ˆw, ˆλ) F( ˆw, λ) α

Since ( ˆw, ˆλ) is the saddle point, then F( ˆw, ˆλ) F( ˆw, λ) 0

Therefore, F( w, ˆλ) F( ˆw, ˆλ) α

Also we have ˆw = arg minw W F(w, ˆλ) and F( , ˆλ) is a µw-strongly convex function, then

µw/2 ˆw w 2 2 F( w, ˆλ) F( ˆw, ˆλ) α

= ˆw w 2 r 2α

A.3. Proof of Lemma 3

Proof. Note that the empirical objective ˆϕ(w, λ) = Pp i=1 λi LSi(w) can also be written in a finite sum form. For any i [p], we denote zj i be the jth datapoint of dataset Si and define batched node ξj = (zj 1, . . . zj p). Hence S can be expressed as S = {ξ1, . . . ξn}. We also define a new loss function f(w, λ, ξ) where ξ = {z1, . . . zp} as

f(w, λ, ξ) =

i=1 λiℓ(w, zi)

Then it is easy to show that

ˆϕ(w, λ) = 1

j=1 f(w, λ, ξj) (9)

Suppose S = {ξ1, . . . ξi, . . . ξn} and S = {ξ1, . . . ξ i, . . . ξn} where ξi and ξ i differ with single datapoint zj i for some j [n]. By equation (9), we have

ˆϕ(w, λ) = 1

ξ S f(w, λ, ξ)

ˆϕ (w, λ) = 1

ξ S f(w, λ, ξ)

We also let Ψ(w, λ) = µw

2 w w 2 2 µλ Pp j=1 λj log λj.

Let ( ˆw, ˆλ) and ( ˆw , ˆλ ) be the exact saddle point of ˆϕ(w, λ) + Ψ(w, λ) and ˆϕ (w, λ) + Ψ(w, λ) respectively.

It is easy to show that wf(w, λ, ξ) 2 L and λf(w, λ, ξ) B for any ξ. Also, ˆϕ(w, λ)+Ψ(w, λ) is µw-strongly convex w.r.t 2 and µλ-strongly concave w.r.t 1. Then using Lemma 2 from (Zhang et al., 2021), we obtain

µw ˆw ˆw 2 2 + µλ ˆλ ˆλ 2

Differentially Private Worst-group Risk Minimization

Furthermore, we have

µw ˆw ˆw 2 2 + µλ ˆλ ˆλ 2

µw + B µwµλ

Meanwhile, by Lemma 12, we also have

w ˆw 2 r 2α

µw + B µwµλ

w ˆw 2 r 2α

µw + B µwµλ

Putting every thing together, we have

µw + B µwµλ

A.4. Proof of Lemma 4

Proof. Let R(w) = maxi [p] LDi(w) and ˆϕ(w, λ) = P

i [p] λi LSi(w). Denote ( ˆw, ˆλ) the exact saddle point of F(w, λ). By the definition of saddle point, we have for any w W and λ p,

ˆϕ( ˆw, λ) + µw

2 ˆw w 2 2 µλ

j=1 λj log λj

ˆϕ(w, ˆλ) + µw

2 w w 2 2 µλ

j=1 ˆλj log ˆλj

= ˆϕ( ˆw, λ) ˆϕ(w, ˆλ)

2 w w 2 2 µλ

j=1 ˆλj log ˆλj

2 ˆw w 2 2 µλ

j=1 λj log λj

2 w w 2 2 µλ

j=1 ˆλj log ˆλj

2 w w 2 2 + µλ log p

In particular, we have for any w W

max i [p] LSi( ˆw) ˆΦ(w, ˆλ) µw

2 w w 2 2 + µλ log p

By Lemma 3, for any fixed i [p], computing ˆw has a uniform stability of γ = 2L2

nµw + 2LB n µwµλ with respect to the change

of datapoint in dataset Si. Therefore, by fixing the randomness of S i = {S1, . . . Si 1, Si+1, . . . Sp} and Theorem 1.1 from (Feldman & Vondrak, 2019), we have

|LSi( ˆw) LDi( ˆw)| c

γ log(n) log(n/β) + B p

Differentially Private Worst-group Risk Minimization

Then we can release the randomness of S i and obtain

|LSi( ˆw) LDi( ˆw)| c

γ log(n) log(n/β) + B p

Using union bound across all i [p], we obtain with probability over 1 β, for all i [p]

|LSi( ˆw) LDi( ˆw)| = O

nµw + LB n µwµλ

log(n) log(np/β) + B p

Therefore, with probability 1 β, we have

| max i [p] LDi( ˆw) max i [p] LSi( ˆw)|

nµw + LB n µwµλ

log(n) log(np/β) + B p

Also, since w is independent of the dataset, with probability over 1 β

|LSi(w) LDi(w)| = O B log(p/β) n

|Φ(w, ˆλ) ˆΦ(w, ˆλ)| = O B log(p/β) n

After putting everything together, we obtain with probability over 1 2β

R( ˆw) R(w) = max i [p] LDi( ˆw) max i [p] LDi(w)

max i [p] LDi( ˆw) ϕ(w, ˆλ)

max i [p] LSi( ˆw) ˆϕ(w, ˆλ) + O

nµw + LB n µwµλ

log(n) log(np/β) + B p

2 w w 2 2 + µλ log p + L2

nµw + LB n µwµλ

log(n) log(np/β) + B p

Choosing β = 1/n and taking expectation over the dataset collection S, we obtain

E [R( ˆw)] R(w) (10)

2 w w 2 2 + µλ log p + L2

nµw + LB n µwµλ

log(n) log(np) + B p

By Lemma 12, we have

w ˆw 2 r 2α

µw + B µwµλ

Given the fact that R( ) is L-Lipschitz, we have

|R( w) R( ˆw)| L w ˆw 2 = O L2

nµw + LB n µwµλ

Differentially Private Worst-group Risk Minimization

Combining equations (10) and (11), we have

E [R( w)] R(w)

2 w w 2 2 + µλ log p + L2

nµw + LB n µwµλ

log(n) log(np) + B p

A.5. Proof of Theorem 2

Proof. Privacy: At iteration t, by the stability argument in Lemma 3, we have

wt wt 2 3L ntµtw + 3B nt p

= 3Lηt + 3B ntnηtη

(log2 n)ηtη

where wt and wt are outputs from neighboring dataset collections St and St that differ in one datapoint. Since the sampled minibatches St are disjoint across iterations, the privacy of Algorithm 1 follows from the privacy guarantee of the Gaussian mechanism and parallel composition.

Utility: By Lemma 4, we have

E[R( wt) R( wt 1)]

E[ ξt 1 2 2] ntηt + log p

ηn + D2 ηηt log1.5 n log(np) + B p

where R(w) = maxi [p] LDi(w).

Also, we have E ξt 2 2 = dσ2 t = 72d D2 log(n) log(1/δ)2 tη2/ϵ2

Therefore, as long as

72d log(n) log(1/δ)

We will have E ξt 2 2 2 t M 2 and E ξt 2

Let w0 = w and ξ0 = w0 w . Then ξ0 2 M. Hence

E[R(w T )] R(w ) =

t=1 E[R( wt) R( wt 1)] + E[R(w T ) R( ˆw T )]

E[ ξt 1 2 2] ntηt + log p

ηn + D2 ηηt log1.5 n log(np) + B p

log(n) log(pn) n

+ LE[ ξT 2] (R( ) is L-Lipschitz.)

We have E[ ξt 1 2 2] ntηt 2 (t 1)M 2 log2 n

n2 tη = 2(log2 n)M 2

and E[ ξT 2] M

2 log2 n = M n.

Differentially Private Worst-group Risk Minimization

E[R(w T )] R(w )

ηn + log(p)

ηn + D2η log1.5(n) log(np) + B p

log(n) log(pn) n

(log(np))M 2

ηn + D2η log2.5(np) + B p

log(n) log(np) n

MD log11/4(np) n + (MD log5/2(np))

Replacing n with K/p gives the desired result.

B. Missing Proofs in Section 4

B.1. Proof of Theorem 7

Proof. It is easy to see that the objective in (3) is equivalent to

min w W max λ p

i [p] λi LDi(w)

where p = {λ [0, 1]p : λ 1 = 1} the probability simplex over p users. Recall ϕ(w, λ) = P

i [p] λi LDi(w), then we have

max λ p ϕ( w T , λ) min w W max λ p ϕ(w, λ)

= max λ p ϕ( w T , λ) max λ p min w W ϕ(w, λ)

T max w W,λ p

t=1 ϕ(wt, λ) ϕ(w, λt)

T max w W,λ p

t=1 (ϕ(wt, λ) ϕ(wt, λt) + ϕ(wt, λt) ϕ(w, λt))

t=1 ϕ(wt, λt) min w W

t=1 ϕ(w, λt)

t=1 ϕ(wt, λt) min λ p

t=1 ( ϕ(wt, λ))

We address the term A first. By fixing the randomness of wt and λt for some iteration t, we have Ex t ℓ(wt, x t ) = ϕ(wt, λt). By extending the same argument to all t [T], we obtain for any w,

t=1 (ϕ(wt, λt) ϕ(w, λt))

t=1 ℓ(wt, x t ) ℓ(w, x t )

By the regret guarantee of Q , we have for any sequence x 1 , . . . x T

t=1 ℓ(wt, x t ) ℓ(w, x t )

Differentially Private Worst-group Risk Minimization

where the expectation is taken from the randomness over the algorithm. Therefore, we have

E[A] r Q (T) (12)

We then address the term B. We first define ℓ (w, x) as the difference between some fixed constant U and the original loss function ℓ(w, x), that is, ℓ (w, x) = U ℓ(w, x). Similarly the vector of p population risks over the new loss function ℓ is therefore written as g(w) = [L D1(w), . . . L Dp(w)]. Therefore, the term B can be equivalently written as

t=1 g(wt), λt min λ p

t=1 g(wt), λ

Notice that at each iteration t, we have λ g(wt), λ = g(wt)

We construct a unbiased estimator of g(wt) as

λi t i = jt 0 o.w.

By letting U = B + 2B

ϵ log(T), we have we have w.p. over 1 1

T , maxt [T ] |yt| 2B

ϵ log(T), which we denote as event E. Conditioned on E, we have lt [0, 2U] for any t [T].

Observing that for any fixed wt, since x+ t Dit, we obtain

E[g (wt)|E] = g(wt)

We now establish the bound on term B

t=1 g(wt), λt

t=1 g(wt), λ

t=1 g (wt), λt

t=1 g (wt), λ

Since g (wt) 2U conditioned on event E, by setting η = q

ln(p) p T U2 and invoking the regret bound of EXP3, we obtain

t=1 g(wt), λt

t=1 g(wt), λ

Then we take the expectation on E and obtain

E[B] = P(E)E[B|E] + (1 P(E))E[B|Ec] = O

Finally we plug in the value of U,

B + B log(T)

We combine equations (12) and (13) to get the desired result.

Differentially Private Worst-group Risk Minimization

C. Missing Proofs from Section 5

C.1. Proof of Theorem 9 and 10

Proof. (Proofs of Theorem 9) Note that given two neighboring data collections S = {S1, . . . Sp} and S = {S 1, . . . S p} that differ in one datapoint of jth dataset for some j [p]. In iteration t, if it = j, then the output distribution of wt+1 is same for S and S . Otherwise, generating wt+1 satisfies ( ϵ c

T log(1/δ), δ 2T )-DP based on the property of Gaussian mechanism

and privacy amplification by subsampling. Therefore, overall generating wt+1 is ( ϵ c

T log(1/δ), δ 2T )-DP.

Meanwhile, generating λt+1 is ( ϵ c

T log(1/δ), 0)-DP based on the property of Laplace mechanism. Using basic composition,

we have generating the pair (wt+1, λt+1) is ( 2ϵ c

T log(1/δ), δ 2T )-DP. Hence, by strong composition for t [T], we obtain the

algorithm is (ϵ, δ)-DP.

Proof. (Proof of Theorem 10) Recall that n = K/p. We have

max λ p ˆϕ( w T , λ) min w W max λ p ˆϕ(w, λ)

max λ p ˆϕ( w T , λ) min w W ˆϕ(w, λT )

max λ p,w W

ˆϕ(wt, λ) ˆϕ(w, λt) )

= max λ p,w W

ˆϕ(wt, λ) + ˆϕ(wt, λt) ˆϕ(wt, λt) ˆϕ(w, λt) )

t=1 ˆϕ(wt, λt) min w W

t=1 ˆϕ(w, λt)

t=1 ˆϕ(wt, λt) min λ p

t=1 ( ˆϕ(wt, λ))

Next, we will bound the terms A and B separately. We address term A first. In particular, for any w W, we can rewrite the expectation of term A as

ˆϕ(wt, λt) ˆϕ(w, λt) #

t=1 ˆϕ(wt, λt), wt w

For any iteration t, by fixing the randomness up to wt and λt and letting t = 1

z Bt ℓ(wt, z) + Gt be the gradient update in step 6 in Algorithm 3, we have

E[ t] = ˆϕ(wt, λt)

Differentially Private Worst-group Risk Minimization

We can extend the same augment for all t [T] and obtain

t=1 ˆϕ(wt, λt), wt w

t=1 t, wt w

t=1 E h t 2i

where the first equality holds by standard convex analysis.

Meanwhile, we let U = B + 8B

T log(1/δ) log(pn T/2), and denote L t = [U LSi(wt) + yi,t]p i=1. By using a union bound, we have L (t) 2U for all t [T] with probability over 1 1

n, which we denote as event E. Conditioned on event E, our update rule for λ can be seen as applying the Hedge algorithm with L t as the loss input and the term B is formulated as its corresponding regret bound. Therefore, by the regret bound of Hedge , we obtain

log(p) log(1/δ) log(pn T)

Then we can take the expectation over E and have

E[B] = P(E)E[B|E] + P(Ec)E[B|Ec] = O

log(p) log(1/δ) log(pn T)

By combining the bounds of E[A] and E[B] and plugging into the values of T, ηw and ηλ, we obtain the desired result.

C.2. Private Active Group Selection

Here we present another algorithm built upon the active group selection scheme described in Algorithm 5. Algorithm 5 is different from Algorithm 3 on the group selection method. Instead of maintaining a weight vector to sample the dataset as in Algorithm 3, we select the dataset with highest loss in each iteration. To maintain privacy, we resort to the report-noisy-max mechanism for the group selection. After such (approximately) worst-off group is selected, we sample a mini-batch from it and use Noisy-SGD to update the model parameters.

Algorithm 5 Noisy SGD with active group selection (Noisy-SGD-AGS)

Input Collection of datasets S = {S1, . . . Sp} Zn p, mini-batch size m, # iterations T, learning rate η for t = 1, . . . , T 1 do

Compute it = arg maxi [p] LSi(wt) + yi,t where yi,t iid Lap(τ). Sample Bt = {z1, . . . zm} from Sit uniformly with replacement. Update the model by

wt+1 = Proj W

z Bt ℓ(wt, z) + Gt

where Gt N(0, σ2Id). end for Output w T = 1

T PT t=1 wt

Theorem 13. (Privacy guarantee) Algorithm 5 is (ϵ, δ)-DP with τ = c Bp

T log(1/δ) and σ2 = c T L2q2 log(K/δ) log(1/δ)

K2ϵ2 for some universal constant c.

Differentially Private Worst-group Risk Minimization

Theorem 14. (Convergence rate) with probability over 1 κ, by letting T = MLKϵ 16Bp

log(1/δ) and η = M

T (G2+dσ2), we

E[max i [p] LSi( w T )] min w W max i [p] LSi(w) =

polylog(p, δ, κ, n)

Proof. By the union bound, we have with probability over 1 κ

T log(1/δ) log(p T/2κ)

for all i [p] and t [T]. We denote this as event A and condition on A. By denoting R(κ, T) = 8B

T log(1/δ) log(p T/2κ), we have

E[max i [q] LSi( w T )]

t=1 max i [q] LSi(wt)

(convexity of max i LSi)

t=1 LSit(wt)

+ R(κ, T) (Event A)

For any fixed w , by the convexity of the loss function, we obtain

t=1 (LSit(wt) LSit(w ))

t=1 LSit(wt), wt w

Denote t = 1

z Bt ℓ(wt, z) + Gt where Bt is the mini-batch uniformly sampled from Sit and Gt N(0, σ2Id). If we fix the randomness of wt and it, it is easy to see that

E[ t] = LSit(wt)

where the randomness comes from the datapoint sampling and added Gaussian noise.

By releasing the randomness of wt and it and extending the same analysis to all t [T], we obtain

LSit(wt) LSit(w ) #

t=1 LSit(wt), wt w

t=1 t, wt w

t=1 E h t 2i

Differentially Private Worst-group Risk Minimization

Therefore, we have

E[max i [q] LSi( w T )]

t=1 LSit(wt)

+ R(κ, T) (Event A)

t=1 LSit(w )

2 + R(κ, T)

max i [q] LSi(w ) + ηL2

2 + R(κ, T)

By plugging into the value of η, σ2 and T and replacing n with K/p, we get the desired result.

D. Lower Bounds for the Offline Setting

Consider a L-Lipschitz convex loss function ℓ(w, z) bounded by [0, B] for any w W and z Z. We create a new loss function ℓ (w, (z, y)) = ℓ(w, z) + y where y [0, B]. It is easy to see that ℓ ( , (z, y)) is also convex and L-Lipschitz.

Empirical case We let y = B for all datapoints in S1 = {(z1, y1), . . . (zn, yn)} and y = 0 for all datapoints in Si with i = 1. Therefore, it is easy to see that for any w W and i [p]

L S1(w) L Si(w)

where L Si(w) = 1

(z,y) Si ℓ (w, (z, y)).

Then we have min w W max i [p] L S1(w) = min w W L S1(w)

Then the original worst-group empirical risk minimization problem is reduced to single-distribution DP-ERM problem which is known to be lower bounded by Ω

d nϵ for Lipschitz convex loss (Bassily et al., 2014). Therefore, a lower bound

of the excess worst-group empirical risk is also Ω

d nϵ . Since n = K/p, this lower bound can also be written as Ω p

Population case We let y = B when (z, y) D1 and y = 0 when (z, y) Di with i = 1. The distribution of z is arbitrary. Therefore, we have for any w W and i [p]

L D1(w) L Di(w)

and min w W max i [p] L Di(w) = min w W L D1(w)

Similar to the empirical case, we can reduce the worst-group population risk minimization problem to single-distribution DPSCO problem which is lower bounded by Ω

d nϵ + 1 n (Bassily et al., 2019). Therefore a lower bound of the population

excess worst-group risk is also Ω

d nϵ + 1 n . Since n = K/p, this lower bound can be also written as Ω p