# the_bayesian_stability_zoo__11957d73.pdf The Bayesian Stability Zoo Shay Moran Department of Mathematics & Department of Computer Science Technion Israel Institute of Technology; smoran@technion.ac.il Hilla Schefler Department of Mathematics Technion Israel Institute of Technology hillas@campus.technion.ac.il Jonathan Shafer Computer Science Division UC Berkeley shaferjo@berkeley.edu We show that many definitions of stability found in the learning theory literature are equivalent to one another. We distinguish between two families of definitions of stability: distribution-dependent and distribution-independent Bayesian stability. Within each family, we establish equivalences between various definitions, encompassing approximate differential privacy, pure differential privacy, replicability, global stability, perfect generalization, TV stability, mutual information stability, KL-divergence stability, and Rényi-divergence stability. Along the way, we prove boosting results that enable the amplification of the stability of a learning rule. This work is a step towards a more systematic taxonomy of stability notions in learning theory, which can promote clarity and an improved understanding of an array of stability concepts that have emerged in recent years. 1 Introduction Algorithmic stability is a major theme in learning theory, where seminal results have firmly established its close relationship with generalization. Recent research has further highlighted the intricate interplay between stability and additional properties of interest beyond statistical generalization. These properties encompass privacy [DMNS06], fairness [HKRR18], replicability [BGH+23, ILPS22], adaptive data analysis [DFH+15b, DFH+15a], and mistake bounds in online learning [ALMM19, BLM20]. This progress has come with a proliferation of formal definitions of stability, including pure and approximate Differential Privacy [DMNS06, DKM+06], Perfect Generalization [CLN+16], Global Stability [BLM20], KL-Stability [Mc A99], TV-Stability [KKMV23], f-Divergence Stability [EGI20], Rényi Divergence Stability [EGI20], and Mutual Information Stability [XR17, BMN+18], as well as related combinatorial quantities such as the Littlestone dimension [Lit87] and the clique dimension [AMSY23]. It is natural to wonder to what extent these various and sundry notions of stability actually differ from one another. The type of equivalence we consider between definitions of stability is as follows. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Definition A and Definition B are weakly equivalent if for every hypothesis class H the following holds: H has a PAC learning rule that is H has a PAC learning rule that is stable according to Definition A stable according to Definition B This type of equivalence is weak because it does not imply that a learning rule satisfying one definition also satisfies the other. Recent results show that many stability notions appearing in the literature are in fact weakly equivalent. The work of [BGH+23] has shown sample efficient reductions between approximate differential privacy, replicability, and perfect generalization. Combined with the work of [ABL+22, ILPS22, KKMV23, MM22], a rich web of equivalences is being uncovered between approximate differential privacy and other definitions of algorithmic stability (see Fig. 1). In this paper we extend the study of equivalences between notions of stability, and make it more systematic. Our starting point is the following observation: many of the definitions mentioned above belong to a broad family of definitions of stability, which we informally call Bayesian definitions of stability. Definitions in this family roughly take the following form: a learning rule A is considered stable if the quantity d A(S), P is small enough, where: d is a measure of dissimilarity between distributions. P is a specific prior distribution over hypotheses; A(S) is the posterior distribution, i.e., the distribution of hypotheses generated by the learning rule A when applied to the input sample S. Namely, a Bayesian definition of stability is parameterized by a choice of d, a choice of P, and a specification of how small the dissimilarity is required to be.1 Remark 1.1. To understand our choice of the name Bayesian stability, recall that the terms prior and posterior come from Bayesian statistics. In Bayesian statistics the analyst has some prior distribution over possible hypothesis before conducting the analysis, and chooses a posterior distribution over hypotheses when the analysis is complete. Bayesian stability is defined in terms of the dissimilarity between these two distributions. A central insight of this paper is that there exists a meaningful distinction between two types of Bayesian definitions, based on whether the choice of the prior P depends on the population distribution D: Distribution-independent (DI) stability. These are Bayesian definitions of stability in which P is some fixed prior that depends only on the class H and the learning rule A, and does not depend on the population distribution D. Namely, they take the form: prior P population D m N : d(A(S), P) is small, where S Dm. Distribution-dependent (DD) stability. Here, the prior may depend also on D, so each population distribution D might have a different prior. Namely: population D prior PD m N : d(A(S), PD) is small. A substantial body of literature has investigated the interconnections among distribution-dependent definitions. In Theorem 1.4, we provide a comprehensive summary of the established equivalences. A 1An example for an application in the context of generalization is the classic PAC Bayes Theorem. The theorem assures that for every population distribution and any given prior P, the difference between the population error of an algorithm A and the empirical error of A is bounded by O , where m is the size of the input sample S, and the KL divergence is the measure of dissimilarity between the prior and the posterior . See e.g. Theorem 3.2. natural question arises as to whether a similar web of equivalences exists for distribution-independent definitions. Our principal contribution is to affirm that, indeed, such a network exists. Identifying such equivalences is a step towards creating a comprehensive taxonomy of stability definitions. 1.1 Our Contribution Our first main contribution is an equivalence between distribution-independent definitions of stability. Theorem (Informal Version of Theorem 2.1). The following definitions of stability are weakly equivalent: 1. Pure Differential Privacy; (Definition 3.5) 2. Distribution-Independent KL-Stability; (Definition 3.6) 3. Distribution-Independent One-Way Pure Perfect Generalization; (Definition 3.7) 4. Distribution-Independent Dα-Stability for α (1, ). (Definition 3.6) Where Dα is the Rényi divergence of order α. Furthermore, a hypothesis class H has a PAC learning rule that is stable according to one of these definitions if and only if H has finite fractional clique dimension (See Appendix B.1). Remark 1.2. Observe that DI KL-stability is equivalent to DI D1-stability, and DI one-way pure perfect generalization is equivalent to DI D -stability. Therefore, The above theorem can be viewed as stating a weak equivalence between pure differential privacy and Dα-stability for α [1, ]. Remark 1.3. In this paper we focus purely on the information-theoretic aspects of learning under stability constraints, and therefore we consider learning rules that are mathematical functions, and disregard considerations of computability and computational complexity. Table 1 summarizes the distribution-independent definitions discussed in Theorem 2.1. All the definitions in each row are weakly equivalent. Table 1: Distribution-independent Bayesian definitions of stability. Name Dissimilarity Definition KL-Stability PS[KL(A(S) P) o(m)] 1 o(1) 3.6 Dα-Stability PS[Dα(A(S) P) o(m)] 1 o(1) 3.6 Pure Perfect Generalization PS O : A(S)(O) eo(m)P(O) 1 o(1) 3.7 One example for how the equivalence results can help build bridges between different stability notions in the literature is the connection between pure differential privacy and the PAC-Bayes theorem. Both of these are fundamental ideas that have been extensively studied. Theorem 2.1 states that a hypothesis class admits a pure differentially private PAC learner if and only if it admits a distribution independent KL-stable PAC learner. This is an interesting and non-trivial connection between two well studied notions. As a concrete example of this connection, recall that thresholds over the real line cannot be learned by a differentially private learner [ALMM19]. Hence, by Theorem 2.1, there does not exist a PAC learner for thresholds that is KL-stable. Another example is half-spaces with margins in Rd. Half-spaces with margins are differentially private learnable [?], therefore there exists a PAC learner for half-spaces with margins that is KL-stable. Our second main contribution is a boosting result for weak learners that have bounded KL-divergence with respect to a distribution-independent prior. Our result demonstrates that distribution-independent KL-stability is boostable. It is interesting to see that one can simultaneously boost both the stability and the learning parameters of an algorithm. Theorem (Informal Version of Theorem 2.2). Let H be a hypothesis class. If there exists a weak learner A for H, and there exists a prior distribution P such that the expectation of KL(A(S) P) is bounded, then there exists a KL-stable PAC learner that admits a logarithmic divergence bound. The proof of Theorem 2.2 relies on connections between boosting of PAC learners and online learning with expert advice. DD Perfect Generalization Approximate DP Replicability MI-Stability Finite Littlestone Dimension Finite Clique Dimension DD TV-Stability DD KL-Stability Global Stability Max Information weakly implies assuming countable domain same algorithm assuming finite domain 1. Lemma C.6 2. Lemma C.13 ([PNG22]) 3. Lemma C.11 4. Lemma C.8 ([LM20]) 5. Corollary 2, [ALMM19] 6. Theorem 10, [BLM20] 7. Theorem 17, [BLM20] 8. Corollary 3.13, [BGH+23] 9. Lemma 3.14, [BGH+23] 10. Theorem 3.17, [BGH+23] 11. Theorem 3.19, [BGH+23] 12. Theorem 3.1, [BGH+23] 13. Theorem 1, [KKMV23] 14. Theorem 2, [KKMV23] 15. Theorem 4, [KKMV23] 16. Theorem 3, [KKMV23] 17. Theorems 2.2, 2.7 [AMSY23] Figure 1: A summary of equivalences between distribution-dependent definitions of stability (Theorem 1.4). A solid black arrow from A to B means that definition A weakly implies definition B. A dashed blue arrow from A to B means that A weakly implies B only if the domain X is countable. A dotted red arrow from A to B means that A weakly implies B only if the domain X is finite. A double brown arrow from A to B means that every learning rule that satisfies definition A also satisfies definition B. Lastly, after conducting an extensive review of the literature, we have compiled a comprehensive network of equivalence results for distribution-dependent definitions of stability. This network is presented in Theorem 1.4, Figure 1, and Table 2. Theorem 1.4 (Distribution-Dependent Equivalences; [ABL+22, ILPS22, MM22, PNG22, BGH+23, KKMV23]). The following definitions of stability are weakly equivalent with respect to an arbitrary hypothesis class H: 1. Approximate Differential Privacy; (Definition 3.5) 2. Distribution-Dependent KL-Stability; (Definition 3.6) 3. Mutual-Information Stability; (Definition 3.12) 4. Global Stability. (Definition 3.11) If the domain is countable then the following are also weakly equivalent to the above: 5. Distribution-Dependent TV-Stability; (Definition 3.13) 6. Replicability. (Definition 3.8) If the domain is finite then the following are also weakly equivalent to the above: 7. One-Way Perfect Generalization; (Definition 3.7) 8. Max Information. (Definition 3.14) Furthermore, for any hypothesis class H, the following conditions are equivalent: H has a PAC learning rule that is stable according to one of the definitions 1 to 6 (and the cardinality of the domain is as described above); H has finite Littlestone dimension; (Definition C.3) H has finite clique dimension. (Definition C.5) We emphasize that Theorem 1.4 is a summary of existing results, and is not a new result. We believe that our compilation serves as a valuable resource, and that stating these results here in a unified framework helps to convey the conceptual message of this paper. Namely, the fact that a large number of disparate results can neatly be organized based on our notions of distribution-dependent and distribution-independent definitions of stability is a valuable observation that can help researchers make sense of the stability landscape. Table 2: Distribution-dependent Bayesian definitions of stability. Name Dissimilarity Definition References KL-Stability PS[KL(A(S) PD) o(m)] 1 o(1) 3.6 [Mc A99] TV-Stability ES[TV(A(S), PD)] o(1) 3.13 [KKMV23] MI-Stability ES[KL(A(S) PD)] o(m) 3.12 [XR17, BMN+18] Perfect Generalization PS[ O : A(S)(O) eεPD(O) + δ] 1 o(1) 3.7 [CLN+16] Global Stability PS,h PD[A(S) = h] η 3.11 [BLM20] Replicability Pr R PS,hr PD,r[A(S; r) = hr] η ν 3.10 [BGH+23, ILPS22] 1.2 Related Works The literature on stability is vast. Stability has been studied in the context of optimization, statistical estimation, regularization (e.g., [Tik43] and [Phi62]), the bias-variance tradeoff, algorithmic stability (e.g., [BE02]; see bibliography in Section 13.6 of [SB14]), bagging [Bre96], online learning and optimization and bandit algorithms (e.g., [Han58]; see bibliography in Section 28.6 of [LS20]), and other topics. There are numerous definitions of stability, including pure and approximate Differential Privacy [DMNS06, DKM+06], Perfect Generalization [CLN+16], Global Stability [BLM20], KL-Stability [Mc A99], TV-Stability [KKMV23], f-Divergence Stability [EGI20], Rényi Divergence Stability [EGI20], and Mutual Information Stability [XR17, BMN+18]. Our work is most directly related to the recent publication by Bun et al. [BGH+23]. They established connections and separations between replicability, approximate differential privacy, max-information and perfect generalization for a broad class of statistical tasks. The reductions they present are sample-efficient, and nearly all are computationally efficient and apply to a general outcome space. Their results are central to the understanding of equivalences between notions of stability as laid out in the current paper. A concurrent work by Kalavasis et al. [KKMV23] showed that TV-stability, replicability and approximate differential privacy are equivalent; this holds for general statistical tasks on countable domains, and for PAC learning on any domain. They also provide a statistical amplification and TV-stability boosting algorithm for PAC learning on countable domains. Additionally, recent works [AUZ23, HKMN23] have shown an equivalence between differential privacy and robustness for estimation tasks. Theorem 2.2 is a boosting result. Boosting has been a central topic of study in computational learning theory since its inception in the 1990s by Schapire [?] and Freund [Fre95]. The best-known boosting algorithm is Ada Boost [FS97], which has been extensively studied. Boosting also has rich connections with other topics such as game theory, online learning, and convex optimization (see [SF12], Chapter 10 in [SB14], and Chapter 7 in [MRT18]). 2 Technical Overview This section presents the complete versions of Theorems 1.4 and 2.2. We provide a concise overview of the key ideas and techniques employed in the proofs. All proofs appear in the appendices. Please refer to Section 3 for a complete overview of preliminaries, including all technical terms and definitions. 2.1 Equivalences between DI Bayesian Notions of Stability The following theorem, which is one of the main results of this paper, shows the equivalence between different distribution-independent definitions. The content of Theorem 2.1 is summarized in Table 1. Theorem 2.1 (Distribution-Independent Equivalences). Let H be a hypothesis class. The following is equivalent. 1. There exists a learning rule that PAC learns H and satisfied pure differential privacy (Definition 3.5). 2. H has finite fractional clique dimension. 3. For every α [1, ], there exists a learning rule that PAC learns H and satisfied distribution-independent Dα-stability (Definition 3.6). 4. For every α [1, ], there exists a distribution-independent Dα-stable PAC learner A for H, that satisfies the following: (i) A is interpolating almost surely. Namely, for every H-realizable distribution D, PS Dm[LS(A(S)) = 0] = 1. (ii) A admits a divergence bound of f(m) = O(log m), with confidence β(m) 0. I.e., for every H-realizable distribution D, Dα(A(S) P) O(log m) with probability 1, where S Dm and P is a prior distribution independent of D. (iii) For every H-realizable distribution D, the expected population loss of A with respect to D satisfies ES Dm[LD(A(S))] O p m 1 log m . In particular, plugging α = 1 in Item (ii) implies KL-stability with divergence bound of f(m) = O(log m) and confidence β(m) 0. Plugging α = implies distribution-independent one-way ε-pure perfect generalization, with ε(m) O(log m) and confidence β(m) 0. 2.1.1 Proof Idea for Theorem 2.1 We prove the following chain of implications: Pure DP (1) == D -Stability (2) == Dα-Stability α [1, ] (3) == Pure DP. Pure DP = D -Stability. The first step towards proving implication (1) is to define a suitable prior distribution P over hypotheses. The key tool we used in order to define P is the characterization of pure DP via the fractional clique dimension [AMSY23]. In a nutshell, [AMSY23] proved that (i) a class H is pure DP learnable if and only if the fractional clique dimension of H is finite; (ii) the fractional clique dimension is finite if and only if there exists a polynomial q(m) and a distribution over hypothesis Pm, such that for every realizable sample S of size m, we have Ph Pm[LS(h) = 0] 1 q(m). (1) (For more details please refer to Appendix B.1.) Now, the desired prior distribution P is defined to be a mixture of all the Pm s. The next step in the proof is to define a learning rule A: (i) sample hypotheses from the prior P; (ii) return the first hypothesis h that is consistent with the input sample S (i.e. LS(h) = 0). A is welldefined since with high probability it will stop and return a hypothesis after q(m) re-samples from P. Since the posterior A(S) is supported on {h : LS(h) = 0}, a simple calculation which follows from Equation (1) shows that for every realizable distribution D, D (A(S) P) log(q(m)) almost surly where S Dm. Finally, since for α [1, ] the Rényi divergence Dα(Q1 Q2) is non-decreasing in α (see Lemma A.1), we conclude that KL(A(S) P) O(log m), hence by PAC-Bayes theorem A generalizes. D -Stability = Dα-Stability α [1, ]. This implication is immediate since the Rényi divergence Dα(Q1 Q2) is non-decreasing in α. Dα-Stability α [1, ] = Pure DP. In fact, it suffices to assume KL-stability. We prove that the promised prior P satisfies that for every realizable sample S of size m, we have Ph P[LS(h) = 0] 1 poly(m), and conclude that H is pure DP learnable. Given a realizable sample S of size m, we uniformly sample m log m examples from S and feed the new sample S to the promised KL-stable learner A. By noting that if KL(A(S ) P) is small, one can lower bound the probability of an event according to P by its probability according to A(S ). The proof then follows by applying a standard concentration argument. 2.2 Stability Boosting We prove a boosting result for weak learners with bounded KL with respect to a distributionindependent prior. We show that every learner with bounded KL that slightly beats random guessing can be amplified to a learner with logarithmic KL and expected loss of O( p m 1 log m). Theorem 2.2 (Boosting Weak Learners with Bounded KL). Let X be a set, let H {0, 1}X be a hypothesis class, and let A be a learning rule. Assume there exists k N and γ > 0 such that D Realizable(H) : ES Dk[LD(A(S))] 1 and there exists P {0, 1}X and b 0 such that D Realizable(H) : ES Dk[KL(A(S) P)] b. (3) Then, there exists an interpolating learning rule A that PAC learns H with logarithmic KL-stability. More explicitly, there exists a prior distribution P {0, 1}X and function b and ε that depend on γ and b such that D Realizable(H) m N : PS Dm[KL(A (S) P ) b (m) = O(log(m))] = 1, (4) ES Dm[LD(A (S))] ε (m) = O 2.2.1 Proof Idea for Theorem 2.2 The strong learning rule A is obtained by simulating the weak learner A on O(log m/γ2) samples of constant size k (which are carefully sampled from the original input sample S). Then, A returns an aggregated hypothesis the majority vote of the outputs of A. As it turns out, A satisfies logarithmic KL-stability with respect to the prior P that is a mixture of majority votes of the original prior P. The analysis involves a reduction to regret analysis of online learning using expert advice, and also uses properties of the KL-divergence. 3 Preliminaries 3.1 Divergences The Rényi α-divergence is a measure of dissimilarity between distributions that generalizes many common dissimilarity measures, including the Bhattacharyya coefficient (α = 1/2), the Kullback Leibler divergence (α = 1), the log of the expected ratio (α = 2), and the log of the maximum ratio (α = ). Definition 3.1 (Rényi divergence; [Rén61, v EH14]). Let α (1, ). The Rényi divergence of order α of the distribution P from the distribution Q is Dα(P Q) = 1 α 1 log For α = 1 and α = the Rényi divergence is extended by taking a limit. In particular, the limit α 1 gives the Kullback Leibler divergence, D1(P Q) = Ex P D (P Q) = log ess sup P with the conventions that 0/0 = 0 and x/0 = for x > 0. 3.2 Learning Theory We use standard notation from statistical learning (e.g., [SB14]). Given a hypothesis h : X {0, 1}, the empirical loss of h with respect to a sample S = {(x1, y1), . . . , (xm, ym)} is defined as LS(h) = 1 m Pm i=1 1[h(xi) = yi]. A learning rule A is interpolating if for every input sample S, Ph A(S)[LS(h) = 0] = 1. The population loss of h with respect to a population distribution D over X {0, 1} is defined as LD(h) = P(x,y) D[h(x) = y]. A population D over labeled examples is realizable with respect to a class H if infh H LD(h) = 0. We denote the set of all realizable population distributions of a class H by Realizable(H). Given a learning rule A and an input sample S of size m, the population loss of A(S) with respect to a population D is defined as Eh A(S)[LD(h)]. A hypothesis class H is Probably Approximately Correct (PAC) learnable if there exists a learning rule A such that for all D Realizable(H) and for all m N, we have ES Dm[LD(A(S))] ε(m), where limm ε(m) = 0. Theorem 3.2 (PAC-Bayes Bound; [Mc A99, LSM01, Mc A03]; Theorem 31.1 in [SB14]). Let X be a set, let H {0, 1}X , and let D (X {0, 1}). For any β (0, 1) and for any P (H), Q (H) : LD(Q) LS(Q) + KL(Q P) + ln(m/β) 3.3 Definitions of Stability Throughout the following section, let X be a set called the domain, let H {0, 1}X be a hypothesis class, and let m N be a sample size. A randomized learning rule, or a learning rule for short, is a function A : (X {0, 1}) {0, 1}X that takes a training sample and outputs a distribution over hypotheses. A population distribution is a distribution D (X {0, 1}) over labeled domain elements, and a prior distribution is a distribution P {0, 1}X over hypotheses. 3.3.1 Differential Privacy Differential privacy is a property of an algorithm that guarantees that the output will not reveal any meaningful amount of information about individual people that contributed data to the input (training data) used by the algorithm. See [DR14] for an introduction. Definition 3.3. Let ε, δ R 0, and let P and Q be two probability measures over a measurable space (Ω, F). We say that P and Q are (ε, δ)-indistinguishable and write P ε,δ Q, if for every event O F, P(O) eε Q(O) + δ and Q(O) eε P(O) + δ. Definition 3.4 (Differential Privacy; [DR14]). Let ε, δ R 0. A learning rule A is (ε, δ)-differentially private if for every pair of training samples S, S (X {0, 1})m that differ on a single example, A(S) and A(S ) are (ε, δ)-indistinguishable. Typically, ε is chosen to be a small constant (e.g., ε 0.1) and δ is negligible (i.e., δ(m) m ω(1)). When δ = 0 we say that A satisfies pure differentially privacy. Definition 3.5 (Private PAC Learning). H is privately learnable or DP learnable if it is PAC learnable by a learning rule A which is (ε(m), δ(m))-differentially-private, where ε(m) 1 and δ(m) = m ω(1). A is pure DP learnable if the same holds with δ(m) = 0. 3.3.2 Dα-Stability and KL-Stability Definition 3.6 (Dα-Stability). Let α [1, ]. Let A be a learning rule, and let f : N R and β : N [0, 1] satisfy f(m) = o(m) and β(m) = o(1). 1. A is distribution-independent Dα-stable if prior P population D m N : PS Dm[Dα(A(S) P) f(m)] 1 β(m). 2. A is distribution-dependent Dα-stable if population D prior PD m N : PS Dm[Dα(A(S) PD) f(m)] 1 β(m). The function f is called the divergence bound and β is called the confidence. The special case of α = 1 is referred to as KL-stability [Mc A99]. 3.3.3 Perfect Generalization Definition 3.7 (One-Way Perfect Generalization). Let A be a learning rule, and let β : N [0, 1] satisfy β(m) = o(1). 1. Let ε : N R satisfy ε(m) = o(m). A is ε-pure perfectly generalizing with confidence β if prior P population D m N : PS Dm h O : A(S)(O) eε(m)P(O) i 1 β(m). 2. ([CLN+16]:) Let ε, δ R 0. A is (ε, δ)-approximately perfectly generalizing with confidence β if population D prior PD m N : PS Dm[ O : A(S)(O) eεPD(O) + δ] 1 β(m). 3.3.4 Replicability Definition 3.8 (Replicability; [BGH+23, ILPS22]). Let ρ R>0 and let R be a distribution over random strings. A learning rule A is ρ-replicable if population D, m : PS1,S2 Dm r R [A(S1; r) = A(S2; r)] ρ, where r represents the random coins of A. Remark 3.9. Note that both in [BGH+23] and in [ILPS22] the definition of ρ-replicability is slightly different. In their definition, they treat the parameter ρ as the failure probability, i.e., A is a ρreplicable learning rule by their definition if the probability that A(S1; r) = A(S2; r) is at least 1 ρ. There exists an alternative 2-parameter definition of replicability introduced in [ILPS22]. Definition 3.10 ((η, ν)-Replicability; [BGH+23, ILPS22]). Let η, ν R>0 and let R be a distribution over random strings. Coin tosses r are η-good for a learning rule A with respect to a population distribution D if there exists a canonical output hr such that for every m, PS Dm[A(S; r) = hr] η. A learning rule A is (η, ν)-replicable if population D : Pr R[r is η-good] ν. 3.3.5 Global Stability Definition 3.11 (Global Stability; [BLM20]). Let η > 0 be a global stability parameter. A learning rule A is (m, η)-globally stable with respect to a population distribution D if there exists a canonical output h such that P[A(S) = h] η, where the probability is over S Dm as well as the internal randomness of A. 3.3.6 MI-Stability Definition 3.12 (Mutual Information Stability; [XR17, BMN+18]). A learning rule A is MI-stable if there exists f : N N with f = o(m) such that population D m N : I(A(S), S) f(m), where S Dm. 3.3.7 TV-Stability Definition 3.13 (TV-Stability; Appendix A.3.1 in [KKMV23]). Let A be a learning rule, and let f : N N satisfy f(m) = o(1). 1. A is distribution-independent TV-stable if prior P population D m N : ES Dm[TV(A(S), P)] f(m). 2. A is distribution-dependent TV-stable if population D prior PD m N : ES Dm[TV(A(S), PD)] f(m). 3.3.8 Max Information Definition 3.14. Let A be a learning rule, and let ε, δ R 0. A has (ε, δ)-max-information with respect to product distributions if for every event O we have P[(A(S), S) O] eεP[(A(S), S ) O] + δ where are S, S are independent samples drown i.i.d from a population distribution D. Acknowledgements SM is a Robert J. Shillman Fellow; he acknowledges support by ISF grant 1225/20, by BSF grant 2018385, by an Azrieli Faculty Fellowship, by Israel PBC-VATAT, by the Technion Center for Machine Learning and Intelligent Systems (MLIS), and by the the European Union (ERC, GENERALIZATION, 101039692). HS acknowledges support by ISF grant 1225/20, and by the the European Union (ERC, GENERALIZATION, 101039692). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. JS was supported by DARPA (Defense Advanced Research Projects Agency) contract #HR001120C0015 and the Simons Collaboration on The Theory of Algorithmic Fairness. [ABL+22] Noga Alon, Mark Bun, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private and online learnability are equivalent. Journal of the ACM, 69(4):28:1 28:34, 2022. doi:10.1145/3526074. [ALMM19] Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private PAC learning implies finite Littlestone dimension. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 852 860. ACM, 2019. doi:10.1145/3313276.3316312. [AMSY23] Noga Alon, Shay Moran, Hilla Schefler, and Amir Yehudayoff. A unified characterization of private learnability via graph theory. Co RR, abs/2304.03996, 2023, 2304.03996. doi:10.48550/ar Xiv.2304.03996. [AUZ23] Hilal Asi, Jonathan R. Ullman, and Lydia Zakynthinou. From robustness to privacy and back. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 1121 1146. PMLR, 2023. URL https://proceedings.mlr.press/v202/asi23b.html. [BE02] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2:499 526, 2002. URL http://jmlr.org/papers/v2/ bousquet02a.html. [BGH+23] Mark Bun, Marco Gaboardi, Max Hopkins, Russell Impagliazzo, Rex Lei, Toniann Pitassi, Satchit Sivakumar, and Jessica Sorrell. Stability is stable: Connections between replicability, privacy, and adaptive generalization. In Barna Saha and Rocco A. Servedio, editors, Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC 2023, Orlando, FL, USA, June 20-23, 2023, pages 520 527. ACM, 2023. doi:10.1145/3564246.3585246. [BLM20] Mark Bun, Roi Livni, and Shay Moran. An equivalence between private classification and online prediction. In Sandy Irani, editor, 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, Durham, NC, USA, November 16-19, 2020, pages 389 402. IEEE, 2020. doi:10.1109/FOCS46700.2020.00044. [BMN+18] Raef Bassily, Shay Moran, Ido Nachum, Jonathan Shafer, and Amir Yehudayoff. Learners that use little information. In Firdaus Janoos, Mehryar Mohri, and Karthik Sridharan, editors, Algorithmic Learning Theory, ALT 2018, 7-9 April 2018, Lanzarote, Canary Islands, Spain, volume 83 of Proceedings of Machine Learning Research, pages 25 55. PMLR, 2018. URL http://proceedings.mlr.press/v83/bassily18a.html. [BPS09] Shai Ben-David, Dávid Pál, and Shai Shalev-Shwartz. Agnostic online learning. In COLT 2009 The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009, 2009. URL http://www.cs.mcgill.ca/%7Ecolt2009/papers/ 032.pdf#page=1. [Bre96] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123 140, 1996. doi:10.1007/BF00058655. [CLN+16] Rachel Cummings, Katrina Ligett, Kobbi Nissim, Aaron Roth, and Zhiwei Steven Wu. Adaptive learning with robust generalization guarantees. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, volume 49 of JMLR Workshop and Conference Proceedings, pages 772 814. JMLR.org, 2016. URL http://proceedings.mlr.press/v49/cummings16.html. [DFH+15a] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis. Science, 349(6248):636 638, 2015. [DFH+15b] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Rocco A. Servedio and Ronitt Rubinfeld, editors, Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 117 126. ACM, 2015. doi:10.1145/2746539.2746580. [DKM+06] Cynthia Dwork, Krishnaram Kenthapadi, Frank Mc Sherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Serge Vaudenay, editor, Advances in Cryptology EUROCRYPT 2006, 25th Annual International Conference on the Theory and Applications of Cryptographic Techniques, St. Petersburg, Russia, May 28 - June 1, 2006, Proceedings, volume 4004 of Lecture Notes in Computer Science, pages 486 503. Springer, 2006. doi:10.1007/11761679_29. [DMNS06] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam D. Smith. Calibrating noise to sensitivity in private data analysis. In Shai Halevi and Tal Rabin, editors, Theory of Cryptography, Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006, Proceedings, volume 3876 of Lecture Notes in Computer Science, pages 265 284. Springer, 2006. doi:10.1007/11681878_14. [DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014. doi:10.1561/0400000042. [EGI20] Amedeo Roberto Esposito, Michael Gastpar, and Ibrahim Issa. Robust generalization via f-mutual information. In IEEE International Symposium on Information Theory, ISIT 2020, Los Angeles, CA, USA, June 21-26, 2020, pages 2723 2728. IEEE, 2020. doi:10.1109/ISIT44484.2020.9174117. [Fre95] Yoav Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121(2):256 285, 1995. doi:10.1006/INCO.1995.1136. [FS97] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997. doi:10.1006/JCSS.1997.1504. [Han58] James Hannan. Approximation to Bayes risk in repeated play. In Contributions to the Theory of Games (AM-39), Volume III, pages 97 140, Princeton, 1958. Princeton University Press. doi:doi:10.1515/9781400882151-006. [HKMN23] Samuel B. Hopkins, Gautam Kamath, Mahbod Majid, and Shyam Narayanan. Robustness implies privacy in statistical estimation. In Barna Saha and Rocco A. Servedio, editors, Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC 2023, Orlando, FL, USA, June 20-23, 2023, pages 497 506. ACM, 2023. doi:10.1145/3564246.3585115. [HKRR18] Úrsula Hébert-Johnson, Michael P. Kim, Omer Reingold, and Guy N. Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 1944 1953. PMLR, 2018. URL http://proceedings.mlr.press/v80/hebert-johnson18a.html. [ILPS22] Russell Impagliazzo, Rex Lei, Toniann Pitassi, and Jessica Sorrell. Reproducibility in learning. In Stefano Leonardi and Anupam Gupta, editors, STOC 22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 818 831. ACM, 2022. doi:10.1145/3519935.3519973. [KKMV23] Alkis Kalavasis, Amin Karbasi, Shay Moran, and Grigoris Velegkas. Statistical indistinguishability of learning algorithms. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 15586 15622. PMLR, 2023. URL https://proceedings.mlr.press/v202/ kalavasis23a.html. [Lit87] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning, 2(4):285 318, 1987. doi:10.1007/BF00116827. [LM20] Roi Livni and Shay Moran. A limitation of the PAC-Bayes framework. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/ hash/ec79d4bed810ed64267d169b0d37373e-Abstract.html. [LS20] Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020. doi:10.1017/9781108571401. [LSM01] John Langford, Matthias W. Seeger, and Nimrod Megiddo. An improved predictive accuracy bound for averaging classifiers. In Carla E. Brodley and Andrea Pohoreckyj Danyluk, editors, Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, 2001, pages 290 297. Morgan Kaufmann, 2001. [Mc A99] David A. Mc Allester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355 363, 1999. doi:10.1023/A:1007618624809. [Mc A03] David A. Mc Allester. Simplified PAC-Bayesian margin bounds. In Bernhard Schölkopf and Manfred K. Warmuth, editors, Computational Learning Theory and Kernel Machines, 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003, Proceedings, volume 2777 of Lecture Notes in Computer Science, pages 203 215. Springer, 2003. doi:10.1007/978-3-540-45167-9_16. [MM22] Maryanthe Malliaris and Shay Moran. The unstable formula theorem revisited. Co RR, abs/2212.05050, 2022, 2212.05050. doi:10.48550/ar Xiv.2212.05050. [MRT18] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. Adaptive computation and machine learning. MIT Press, second edition, 2018. URL https://mitpress.mit.edu/9780262039406. [Phi62] David L. Phillips. A technique for the numerical solution of certain integral equations of the first kind. Journal of the ACM, 9(1):84 97, 1962. doi:10.1145/321105.321114. [PNG22] Aditya Pradeep, Ido Nachum, and Michael Gastpar. Finite Littlestone dimension implies finite information complexity. In IEEE International Symposium on Information Theory, ISIT 2022, Espoo, Finland, June 26 - July 1, 2022, pages 3055 3060. IEEE, 2022. doi:10.1109/ISIT50566.2022.9834457. [PW] Yury Polyanskiy and Yihong Wu. Information Theory: From Coding to Learning. Cambridge University Press. URL https://people.lids.mit.edu/yp/homepage/ data/itbook-export.pdf. Unpublished manuscript, 2023. [Rén61] Alfréd Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, volume 4, pages 547 562. University of California Press, 1961. [SB14] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. doi:https://doi.org/10.1017/CBO9781107298019. [SF12] Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. MIT Press, 2012. doi:https://doi.org/10.7551/mitpress/8291.001.0001. [She90] Saharon Shelah. Classification Theory: And the Number of Non-Isomorphic Models, volume 92 of Studies in Logic and the Foundations of Mathematics. North-Holland, second edition, 1990. [Tik43] A. N. Tikhonov. On the stability of inverse problems. Proceedings of the USSR Academy of Sciences, 39:195 198, 1943. URL https://api.semanticscholar. org/Corpus ID:202866372. [v EH14] Tim van Erven and Peter Harremoës. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory, 60(7):3797 3820, 2014. doi:10.1109/TIT.2014.2320500. [XR17] Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 2524 2533, 2017. URL https://proceedings.neurips.cc/paper/2017/ hash/ad71c82b22f4f65b9398f76d8be4c615-Abstract.html. A Proof for Theorem 2.2 (Stability Boosting) A.1 Information Theoretic Preliminaries Lemma A.1 (Monotonicity of Rényi divergence; Theorem 3 in [v EH14]). Let 0 α < β . Then Dα(P Q) Dβ(P Q). Furthermore, the inequality is an equality if and only if P equals the conditional Q( | A) for some event A. Lemma A.2 (Data Processing Inequality; Theorem 9 and Eq. 13 in [v EH14]). Let α [0, ]. Let X and Y be random variables, and let FY |X be the law of Y given X. Let PY , QY be the distributions of Y when X is sampled from PX, QX, respectively. Then Dα(PY QY ) Dα(PX QX). One interpretation of this is that processing an observation makes it more difficult to determine whether it came from PX or QX. Definition A.3 (Conditional KL-divergence; Definition 2.12 in [PW]). Given joint distributions P(x, y), Q(x, y), the KL-divergence of the marginals P(y|x), Q(y|x) is KL(P(y|x) Q(y|x)) = X y P(y|x) log P(y|x) Lemma A.4 (Chain Rule for KL-divergence; Theorem 2.13 in [PW]). Let P(x, y), Q(x, y) be joint distributions. Then, KL(P(x, y) Q(x, y)) = KL(P(x) Q(x)) + KL(P(y|x) Q(y|x)). Lemma A.5 (Conditioning increases KL-divergence; Theorem 2.14(e) in [PW]). For a distribution PX and conditional distributions PY |X, QY |X, let PY = PY |X PX and QY = QY |X PX, where denotes composition (see Section 2.4 in [PW]) Then KL(PY QY ) KL PY |X QY |X PX , with equality if and only if KL PX|Y QX|Y PY = 0. A.2 Online Learning Preliminaries Following is some basic background on the topic of online learning with expert advice. This will be useful in the proof of Theorem 2.2. Let Z = {z1, . . . , zm} be a set of experts and I be a set of instances. For any instance i I and expert z Z, following the advice of expert z on instance i provides utility u(z, i) {0, 1}. The online learning setting is a perfect-information, zero-sum game between two players, a learner and an adversary. In each round t = 1, . . . , T: 1. The learner chooses a distribution wt (Z) over the set of experts. 2. The adversary chooses an instance it I. 3. The learner gains utility ut = Ez wt[u(z, it)]. The total utility of a learner strategy L for the sequence of instances chosen by the adversary is The regret of the learner is the difference between the utility of the best expert and the learner s utility. Namely, for each z Z, let t=1 u(z, it) be the utility the learner would have gained had they chosen wt(z) = 1 (z = zj) for all t [T]. Then the regret is Regret(L, T) = max z Z U(z, T) U(L, T). There are several well-studied algorithms for online learing using expert advice that guarantee regret sublinear in T for every possible sequence of T instances. A classic example is the Multiplicative Weights algorithm (e.g., Section 21.2 in [SB14]), which enjoys the following guarantee. Theorem A.6 (Online Regret Bound). In the setting of online learning with expert advice, there exists a learner strategy L such that for any sequence of T instances selected by the adversary, Regret(L, T) p where m is the number of experts. Theorem (Theorem 2.2, Restatement). Let X be a set, let H {0, 1}X be a hypothesis class, and let A be a learning rule. Assume there exists k N and γ > 0 such that D Realizable(H) : ES Dk[LD(A(S))] 1 and there exists P {0, 1}X and b 0 such that D Realizable(H) : ES Dk[KL(A(S) P)] b. (7) Then, there exists an interpolating learning rule A that PAC learns H with logarithmic KL-stability. More explicitly, there exists a prior distribution P {0, 1}X and function b and ε that depend on γ and b such that D Realizable(H) m N : PS Dm[KL(A (S) P ) b (m) = O(log(m))] = 1, (8) ES Dm[LD(A (S))] ε (m) = O Assumptions: γ, b > 0; m, k N. S = {(x1, y1), . . . , (xm, ym)} is an H-realizable sample. OS is the online learning algorithm of Appendix A.2, using expert set S. T = 8 log(m)/γ2 + 1. A satisfies Eqs. (6) and (7) (with respect to k, b, γ). for t = 1, . . . , T: wt expert distribution chosen by OS for round t do: sample St (wt)k while KL(A(St) P) 2b/γ See Remark A.7 ft A(St) OS receives instance ft and gains utility E(x,y) wt[1(ft(x) = y)] return Maj(f1, . . . , f T ) Algorithm 1: The stability-boosted learning rule A , which uses A as a subroutine. Proof of Theorem 2.2. Let D Realizable(H) and m N. Learning rule A operates as follows. Given a sample S = {(x1, y1), . . . , (xm, ym)}, A simulates an online learning game, in which S is the set of experts , F = {0, 1}X is the set of instances , and the learner s utility for playing expert (x, y) on instance f F is 1(f(x) = y). Namely, in this game the learner is attempting to select an (x, y) pair that disagrees with the instance f. In this simulation, the learner executes an instance of the online learning algorithm of Appendix A.2 with expert set S. Denote this instance OS. The adversary s strategy is as follows. Recall that at each round t, OS chooses a distribution wt over S. Note that if S is realizable then so is wt. At each round t, the adversary selects an instance f F by executing A on a training set sampled from wt, as in Algorithm 1. We prove the following: 1. A interpolates, namely P[LS(A (S)) = 0] = 1. 2. A has logarithmic KL-stability, as in Eq. (8). 3. A PAC learns H as in Eq. (9). For Item 1, assume for contradiction that A does not interpolate. Seeing as A outputs Maj(f1, . . . , f T ), there exists an index i [m] such that t=1 1(ft(xi) = yi) = U(i, T), (10) where U(i, T) is the utility of always playing expert i throughout the game. Let Et denote the event that St was resampled (i.e., there were multiple iterations of the do-while loop in round t). Eq. (7) and Markov s inequality imply P[Et] = P[KL(A(St) P) 2b/γ] γ/2. (11) The utility of OS at time t is u OS t = E St (wt)k ft A(St) (x,y) wt [1(ft(x) = y)] E St (wt)k[Lwt(A(St)) | Et] + P[Et] 1 where the last inequality follows from Eqs. (6) and (11). Hence, the utility of OS throughout the game is t=1 u OS t 1 Combining Eqs. (10) and (12) and Theorem A.6 yields γ 2 T U(i, T) U(OS, T) Regret(OS, T) p which is a contradiction for our choice of T. This establishes Item 1. For Item 2, for every ℓ N let P ℓ {0, 1}X be the distribution of Maj(g1, . . . , gℓ), where (g1, . . . , gℓ) Pℓ. Let P = 1 z P ℓ=1 P ℓ/ℓ2 where z = P ℓ=1 1/ℓ2 = π2/6 is a normalization factor. For any S (X {0, 1})m, KL(A (S) P T ) = KL(Maj(f1, . . . , f T ) Maj(g1, . . . , g T )) KL((f1, . . . , f T ) (g1, . . . , g T )) (By Lemma A.2) t=1 KL((ft|f 0 return h Algorithm A terminates with probability 1, because for any realizable sample S of size m N and any t N, P[A did not terminate after t iterations] = (Ph P[LS(h) > 0])t 1 1 q(m) where the inequality follows by Eq. (15). To complete the proof, we show that A satisfies (i), (ii) and (iii) in Item 4. Item (i) is immediate from the construction of A. For Item (ii), let m N. For any sample S of size m and hypothesis h CS, QS(h) = P(h | CS) = P({h} CS) P(CS) q(m) P(h), (16) where the inequality follows from Eq. (15). Hence, D (QS P) = log ess sup QS log ess sup QS (from Eq. (16) and QS(CS) = 1) log(q(m)) = O(log(m)). Item (ii) follows from monotonicity of Dα with respect to α (Lemma A.1). In particular, KL(QS P) = O(log(m)). Item (iii) follows from the PAC-Bayes theorem (Theorem 3.2). Indeed, take β = 1 m and note that LS(QS) = 0 for all realizable S. Then for any H-realizable distribution D, KL(QS P) + 2 ln m This implies that for any H-realizable distribution D, E S Dm[LD(A(S))] 1 KL(QS P) + 2 ln m as desired. Remark B.3. The furthermore section of Lemma A.1 implies that in the foregoing proof, Dα(QS P) = Dβ(QS P) for any α, β [0, ]. B.3 DI Rényi-Stability = Finite Fractional Clique Dimension Lemma B.4. In the context of Theorem 2.1: Item 3 = Item 2. Proof of Lemma B.4. By Theorem B.1 and Eq. (14) it suffices to show that there exist m N and a prior P such that for every H-realizable sample S (X {0, 1})m, P h P[LS(h) = 0] > 1 By the assumption (Item 3) and Theorem 2.2, there exists an interpolating learning rule A , a prior P , and a constant C > 0 such that for every D Realizable(H), the equality PS Dm[KL(A (S) P ) C log(m)] = 1 (18) holds for all m N large enough. Fix such an m. We show that taking P = P satisfies Eq. (17) for this m. Let Q denote the distribution of A (S ) where S (U(S))m = PS , U(S) is the uniform distribution over S, and m = m ln(4m). The proof follows by noting that if KL(Q P ) is small then one can lower bounding the probability of an event according to P by its probability according to Q. To see that the KL is indeed small, let PA (S ),S and PH ,S be two joint distributions. The variable S has marginal PS in both distributions, A (S ) Q depends on S , but H P is independent of S . Then, KL(Q P ) = KL PA (S ) PH KL PA (S )|S PH |S PS (Lemma A.5) = KL PA (S )|S PH PS (H S ) = ES [KL(A (S ) P )] (Definition of conditional KL) C log(m). (By Eq. (18) and choice of m) (19) Taking k = 2C log(m), holds by Markov s inequality and the definition of the KL divergence. We are interested in the probability of the event E = h {0, 1}X : LS(h) = 0 . Because A is interpolating, Q(E) PS (U(S))m h A (S ) [S S ] 1 m 1 1 Finally, we lower bound P (E) as follows. P (E) P h P = P h P E P (h) 2 k Q(h) P h Q E P (h) 2 k Q(h) 2 k k 2 k. (De Morgan s + union bound) 4 2 k = 1 4m2C = 1 poly(m). (By Eqs. (20) and (21) and choice of k) This establishes Eq. (17), as desired. C Proof of Theorem 1.4 (DD Equivalences) C.1 Preliminaries C.1.1 Littlestone Dimension The Littlestone dimension is a combinatorial parameter which captures mistake and regret bounds in online learning [Lit87, BPS09]. Definition C.1 (Mistake Tree). A mistake tree is a binary decision tree whose nodes are labeled with instances from X and edges are labeled by 0 or 1 such that each internal node has one outgoing edge labeled 0 and one outgoing edge labeled 1. A root-to-leaf path in a mistake tree can be described as a sequence of labeled examples (x1, y1), . . . , (xd, yd). The point xi is the label of the i-th internal node in the path, and yi is the label of its outgoing edge to the next node in the path. Definition C.2 (Shattering). Let H be a hypothesis class and let T be a mistake tree. H shatters T if every root-to-leaf path in T is realizable by H. Definition C.3 (Littlestone Dimension). Let H be a hypothesis class. The Littlestone dimension of H, denoted LD(H), is the largest number d such that there exists a complete mistake tree of depth d shattered by H. If H shatters arbitrarily deep mistake trees then LD(H) = . C.1.2 Clique Dimension Definition C.4 (Clique; [AMSY23]). Let H be a hypothesis class and let m N. A clique in H of order m is a family S of realizable samples of size m such that (i) |S| = 2m; (ii) every two distinct samples S , S S contradicts, i.e., there exists a common example x X such that (x, 0) S and (x, 1) S . Definition C.5 (Clique Dimension; [AMSY23]). Let H be a hypothesis. The clique dimension of H, denoted CD(H), is the largest number m such that H contains a clique of order m. If H contains cliques of arbitrary large order then we write CD(H) = . C.2 Global Stability = Replicability Lemma C.6. Let H be a hypothesis class and let A be a (m, η)-globally stable learner for H. Then, A is an η-replicable learner for H. This follows immediately by noting that global stability is equivalent to 2-parameters replicability, which is qualitatively equivalent to 1-parameter replicability [ILPS22]. Lemma C.7 ([ILPS22]). For every ρ, η, ν [0, 1], 1. Every ρ-replicable algorithm is also ρ ν 1 ν , ν -replicable. 2. Every (η, ν)-replicable algorithm is also (η + 2ν 2)-replicable. Proof of Lemma C.6. By the assumption, there exists an hypothesis h such that for every population D, we have PR R[PS Dm[A(S; r) = h] η] = 1. Hence A is (η, 1)-replicable, and by Lemma C.7 it is also η-replicable. C.3 DD KL-Stability = Finite Littlestone Dimension Lemma C.8. Let H be a hypothesis class that is distribution-dependent KL-stable. Then H has finite Littlestone dimension. This lemma is an immediate result of the relation between thresholds and the Littlestone dimension, and the fact that the class of thresholds on the natural numbers does not admit any learning rule that satisfies a non-vacuous PAC-Bayes bound [LM20]. The next lemma is a corollary of Theorem 2 in [LM20]. Theorem C.9 (Corollary of Theorem 2 [LM20]). Let m N and let N N. Then, there exists n N large enough such that the following holds. For every learning rule A of the class of thresholds over [n], Hn = {1[x>k] : [n] {0, 1} | k [n]}, there exists a realizable population distribution D = DA such that for any prior distribution P, KL(A(S) P) > N or, LD(A(S)) > 1 Theorem C.10 (Littlestone dimension and thresholds [She90]). Let H be a hypothesis class. Then, 1. If LD(H) d then H contains log d thresholds. 2. If H contains d thresholds then LD(H) log d . Proof of Lemma C.8. If by contradiction the Littlestone dimension of H is unbounded, then by Theorem C.10, H contains a copy of Hn, the class of thresholds over [n], for arbitrary large n s. Hence, by Theorem C.9 H does not admit a PAC learner that is KL-stable. C.4 MI-Stability = DD KL-Stability Lemma C.11. Let H be a hypothesis class and let A be a mutual information stable learner with information bound f(m) = o(1). (I.e. for every population distribution D, I(A(S); S) f(m) where S Dm.) Then, A is a distribution-dependent KL-stable learner with KL bound g(m) = p f(m) m and confidence β(m) = p The following statement is an immediate corollary. Corollary C.12. Let H be a hypothesis class that is mutual information stable. Then H is distributiondependent KL-stable. Proof of Lemma C.11. Let D be a population distribution. Define a prior distribution PD = ES[A(S)], i.e. PD(h) = PS Dm[A(S) = h]. We will show that A is KL stable with respect to the prior PD. We use the identity I(X; Y ) = KL(PX,Y , PXPY ). Let PA(S),S be the joint distribution of the training sample S and the hypothesis selected by A when given S as an input, and let PA(S)PS be the product of the marginals. Note that PA(S)PS is equal in distribution to PA(S )PS, where S is an independent copy of S. Hence, I(A(S); S) = KL(PA(S),S, PA(S)PS) = KL(PA(S)|SPS, PA(S )PS), = KL(PS, PS) + Es PS KL(PA(S)|S=s, PA(S )|S=s) (Chain rule) = Es PS KL(PA(S)|S=s, PA(S )|S=s) = Es PS KL(PA(S)|S=s, PA(S )) . Note that PA(S ) and the prior PD are identically distributed, and PA(S)|S=s is exactly the posterior produced by A given the input sample s. By Markov s inequality, h KL(A(S) PD) p m I(A(S); S) i I(A(S); S) p m I(A(S); S) Since I(A(S); S) f(m), by Eq. (22) h KL(A(S) PD) p Note that since f(m) = o(m), indeed p f(m)/m m 0 and p f(m) m = o(m). C.5 Finite Littlestone Dimension = MI-Stability Lemma C.13. Let H be a hypothesis class with finite Littlestone dimension. Then H admits an information stable learner. This lemma is a direct result of Theorem 2 in [PNG22]. Definition C.14. The information complexity of a hypothesis class H is IC(H) = sup |S| inf A sup D I(A(S); S) where the supremum is over all sample sizes |S| N and the infimum is over all learning rules that PAC learn H. Theorem C.15 (Theorem 2 [PNG22]). Let H be a hypothesis class of with Littlestone dimension d. Then the information complexity of H is bounded by IC(H) 2d + log(d + 1) + 3 + 3 e ln 2. Proof of Lemma C.13. Since finite information complexity implies that H admits an information stable learner, the proof follows from Theorem C.15