# on_the_computational_landscape_of_replicable_learning__1bed8cba.pdf

On the Computational Landscape of Replicable Learning

Alkis Kalavasis Yale University alvertos.kalavasis@yale.edu

Amin Karbasi Yale University amin.karbasi@yale.edu

Grigoris Velegkas Yale University grigoris.velegkas@yale.edu

Felix Zhou Yale University felix.zhou@yale.edu

We study computational aspects of algorithmic replicability, a notion of stability introduced by Impagliazzo, Lei, Pitassi, and Sorrell [2022]. Motivated by a recent line of work that established strong statistical connections between replicability and other notions of learnability such as online learning, private learning, and SQ learning, we aim to understand better the computational connections between replicability and these learning paradigms. Our first result shows that there is a concept class that is efficiently replicably PAC learnable, but, under standard cryptographic assumptions, no efficient online learner exists for this class. Subsequently, we design an efficient replicable learner for PAC learning parities when the marginal distribution is far from uniform, making progress on a question posed by Impagliazzo et al. [2022]. To obtain this result, we design a replicable lifting framework inspired by Blanc, Lange, Malik, and Tan [2023] that transforms in a black-box manner efficient replicable PAC learners under the uniform marginal distribution over the Boolean hypercube to replicable PAC learners under any marginal distribution, with sample and time complexity that depends on a certain measure of the complexity of the distribution. Finally, we show that any pure DP learner can be transformed to a replicable one in time polynomial in the accuracy, confidence parameters and exponential in the representation dimension of the underlying hypothesis class.

1 Introduction

The replicability crisis is omnipresent in many scientific disciplines including biology, chemistry, and, importantly, AI [Baker, 2016, Pineau et al., 2019]. A recent article that appeared in Nature [Ball, 2023] explains how the reproducibility crisis witnessed in AI has a cascading effect across many other scientific areas due to its widespread applications in other fields, like medicine. Thus, a pressing task is to design a formal framework through which we can argue about the replicability of experiments in ML. Such an attempt was initiated recently by the pioneering work of Impagliazzo et al. [2022], who proposed a definition of replicability as a property of learning algorithms.

Definition 1.1 (Replicable Algorithm; Impagliazzo et al., 2022). Let R be a distribution over random strings. A learning algorithm A is n-sample ρ-replicable under distribution D if for two independent sets S, S Dn it holds that Pr S,S Dn,r R[A(S, r) = A(S , r)] ρ. We will say that A is replicable if the above holds uniformly over all distributions D.

We emphasize that the random string r is shared across the two executions and this aspect of the definition is crucial in designing algorithms that have the same output under two different i.i.d. inputs. Indeed, both Dixon et al. [2023] and Chase et al. [2023b] demonstrated learning tasks for which there

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

are no algorithms that satisfy this strong notion of replicability when the randomness is not shared across executions. This shared random string models the random seed of a learning algorithm in practice, and sharing internal randomness can be easily implemented by sharing said random seed.

Closer to our work, Impagliazzo et al. [2022], Ghazi et al. [2021b], Bun et al. [2023], Kalavasis et al. [2023] established strong statistical connections between replicability and other notions of algorithmic stability and learning paradigms such as differential privacy (DP), statistical queries (SQ), and online learning. Although several of these works provided various computational results, these connections are not as well understood as the statistical ones.

1.1 Our Contributions

In this work, we aim to shed further light on the aforementioned computational connections. In particular, we provide both negative and positive results. Negative results are manifested through computational separations while positive results are presented through computational transformations, as we will see shortly. We emphasize that whenever we work within the PAC learning framework we consider the realizable setting.

Replicability & Online Learning. The results of Ghazi et al. [2021b], Bun et al. [2023], Kalavasis et al. [2023] established a statistical connection between replicability and online learning. In particular, in the context of PAC learning, these works essentially show that replicable PAC learning is statistically equivalent to online learning since learnability in both settings is characterized by the finiteness of the Littlestone dimension of the underlying concept class. From the above, the first natural question is the following:

Q1. How does replicability computationally relate to online learning?

We show that under standard cryptographic assumptions, efficient replicability is separated from efficient online learning. Theorem 1.2 (Informal, see Theorem 2.1). Assuming the existence of one-way functions, there is a concept class that is replicably PAC learnable in polynomial time, but is not efficiently online learnable by any no-regret algorithm.

In order to prove the above result, we provide an efficient replicable PAC learner for the concept class of One-Way Sequences OWS over {0, 1}d, introduced by Blum [1994]. This algorithm, which relies on a novel replicable subroutine for quantile estimation (cf. Theorem B.6), combined with the cryptographic hardness result of Bun [2020] for online learnability, gives the desired computational separation. For further details, we refer to Section 2.

Replicability & SQ. The Statistical Query (SQ) framework is an expressive model of statistical learning introduced by Kearns [1998]. This model of learning is a restricted class of algorithms that is only permitted indirect access to samples through approximations of certain carefully chosen functions of the data (statistical queries). The motivation behind the introduction of the SQ framework was to capture a large class of noise-resistant learning algorithms. As such, any SQ algorithm enjoys some inherent stability properties. Since replicability also captures some notion of algorithmic stability, it is meaningful to ask:

Q2. How does replicability computationally relate to SQ learning?

For some background about the SQ framework, we refer the reader to Appendix C.2. A manifestation of the intuitive connection between SQ and replicability can be found in one of the main results of Impagliazzo et al. [2022], which states that any efficient SQ algorithm can be efficiently made replicable. In this paper, we investigate the other direction of this connection.

A particularly interesting computational separation between (distribution-specific) replicability and SQ learning was noticed by Impagliazzo et al. [2022], in the context of realizable binary classification: if we consider the uniform distribution U over the Boolean hypercube {0, 1}d, then the concept class of parities is SQ-hard to learn under U [Kearns, 1998] but admits an efficient PAC learner that is replicable under U. Indeed, with high probability over the random draw of the data, Gaussian elimination, which is the standard algorithm for PAC learning parities under U, gives a unique solution and, as a result, is replicable.

Based on this observation, Impagliazzo et al. [2022] posed as an interesting question whether parities are efficiently learnable by a replicable learner under other marginal distributions, for which Gaussian elimination is unstable , i.e., it fails to give a unique solution (see e.g., Proposition C.4).

Transforming Distribution-Specific Replicable Learners. Inspired by the question of Impagliazzo et al. [2022], we ask the following more general question for realizable binary classification:

Q3. For a class C {0, 1}X and marginal distributions D1, D2 over X, how does replicable PAC learnability of C under D1 and under D2 relate to one another computationally?

Our main result is a general black-box transformation from a replicable PAC learner under the uniform distribution over X = {0, 1}d to a replicable PAC learner under some unknown marginal distribution D. The runtime of the transformation depends on the decision tree complexity of the distribution D, a complexity measure that comes from the recent work of Blanc et al. [2023] and is defined as follows:

Definition 1.3 (Decision Tree Complexity; Blanc et al., 2023). The decision tree complexity of a distribution D over {0, 1}d is the smallest integer ℓsuch that its probability mass function (pmf) can be computed by a depth-ℓdecision tree (cf. Definition D.1).

A decision tree (cf. Definition D.1) T : {0, 1}d R is a binary tree whose internal nodes query a particular coordinate xi, i [d] (descending left if xi = 0 and right otherwise) and whose leaves are labelled by real values. Each x {0, 1}d follows a unique root-to-leaf path based on the queried coordinate of internal nodes, and its value T(x) is the value stored at the root. For some D over the Boolean hypercube, we say D is computed by a decision tree T of depth ℓif, for any x {0, 1}d, it holds that D(x) = T(x).

As an example, note that the uniform distribution requires depth ℓ= 1 since its pmf is constant while a distribution whose pmf takes 2d different values requires depth ℓ= d. However, various natural (structured) distributions have tree complexity much smaller than d. Importantly, the running time overhead of our lifting approach scales proportionally to the decision tree complexity of D and hence can be used to obtain novel efficient replicable PAC learners.

Our replicable lifting framework informally states the following:

Theorem 1.4 (Informal, see Theorem 3.2). Let X = {0, 1}d and m = poly(d). Consider a concept class C {0, 1}X and assume that A is a PAC learner for C that is m-sample replicable under the uniform distribution and runs in time poly(m). Then, for some (unknown) monotone1 distribution D with decision tree complexity ℓ, there exists an algorithm A that is a PAC learner for C that is (mℓ)ℓ-sample replicable under D and runs in time poly((mℓ)ℓ).

This implies for instance that for ℓ= O(1), the blowup is polynomial in the runtime of the uniform learner while if ℓ= O(log d), the running time is quasi-polynomial in d. We note that our result can also be extended to general (non-monotone) distributions D if we assume access to a (subcube) conditional sampling oracle. For formal details, we refer the reader to Section 3.

As an application of this framework, we show how to black-box transform replicable learners for parities that work under the uniform distribution to replicable learners that work under some other unknown distribution D, where the sample complexity and running time of the transformation depends on the decision tree complexity of D. This result is hence related to the question of Impagliazzo et al. [2022] since we can design distributions D over {0, 1}d with small decision tree complexity for which Gaussian elimination fails to be replicable. We refer to Section 4 for the formal statement.

Corollary 1.5 (Informal, see Corollary 4.3 and Theorem C.5). Let X = {0, 1}d and m = poly(d). Then, for any (unknown) monotone distribution D with decision tree complexity ℓ, there exists a PAC learner for parities that is (mℓ)ℓ-sample replicable under D and runs in time poly((mℓ)ℓ).

Moreover, for any m = mΘ(1), there exists some monotone distribution D over {0, 1}d with decision tree complexity ℓ= Θ(1) so that Gaussian elimination, with input m labeled examples, fails to PAC learn parities replicably under D with constant probability.

1Recall that D over {0, 1}d is monotone if whenever x y in the partial order of the poset, we have D(x) D(y).

Replicability & Privacy. Returning to the works of Ghazi et al. [2021b], Bun et al. [2023], Kalavasis et al. [2023], it is known that replicable PAC learning is statistically equivalent to approximate differentially private PAC learning. We hence ask:

Q4. How does replicability computationally relate to private learning?

The recent work of Bun et al. [2023] provided a computational separation between approximate differential privacy and replicability. In particular, they showed that a replicable learner can be efficiently transformed into an approximately differentially private one, but the other direction is hard under standard cryptographic assumptions. To be more specific, they provided a concept class that is efficiently learnable by an approximate DP learner, but not efficiently learnable by a replicable learner, assuming the existence of one-way functions. Based on this hardness result, we ask whether we can transform a pure DP algorithm into a replicable one. We show the following result.

Theorem 1.6 (Informal; see Theorem 5.2). Let DXY be a distribution on X {0, 1} that is realizable with respect to some concept class C . Let A be an efficient pure DP learner for C . Then, there is a replicable learner A for C that runs in time polynomial with respect to the error, confidence, and replicability parameters but exponential in the representation dimension2 of C .

We reiterate that this transformation is efficient with respect to the correctness, confidence and replicability parameters α, β, ρ, but it could be inefficient with respect to some parameter that captures the complexity of the underlying concept class C (such as the representation dimension of the class, e.g., the margin parameter in the case of large-margin halfspaces). To the best of our knowledge, the same holds for the transformation of a pure DP learner to an online learner of Gonen et al. [2019] and it would be interesting to show that this is unavoidable.

The Computational Landscape of Stability. Our work studies several computational aspects of replicability and its connections with other stability notions such as online learning, SQ learning, and differential privacy. Combining our results with the prior works of Blum et al. [2005], Gonen et al. [2019], Ghazi et al. [2021b], Impagliazzo et al. [2022], Bun et al. [2023], Kalavasis et al. [2023] yields the current computational landscape of stability depicted below.

Replicability Online

SQ (ε, δ)-DP

[Blum et al., 2005]

[Gonen et al., 2019]

By definition

This work, Theorem 2.1 |

[Impagliazzo et al., 2022],

This work, Corollary 1.5

[Bun, 2020]

[Bun et al., 2024]

This work, Theorem 5.2

[Ghazi et al., 2021b;

Bun et al., 2023; Kalavasis et al., 2023]

[Bun et al., 2023]

[Impagliazzo et al., 2022]

2The representation dimension is a combinatorial dimension, similar to VC dimension, that characterizes which classes are PAC learnable by pure DP algorithms [Beimel et al., 2013].

Figure 1.1: The computational landscape of stability. A green double arrow ( ) from tail to head indicates that an efficient learner for a task in the setting of the arrow tail can be black-box transformed into an efficient learner for the same task in the setting of the arrow head. Meanwhile, an orange dashed double arrow (== ) from tail to head indicates that an efficient learner for a task in the tail setting can be black-box transformed into an efficient learner for the same task in the head setting, under some additional assumptions. Finally, a red slashed single arrow ( ) from tail to head indicates that there is a learning task for which an efficient learner exists in the setting of the arrow tail but no efficient learner can exist in the setting of the arrow head, possibly under some cryptographic assumptions.

The computational landscape of stable learning depicted in Figure 1.1 is a byproduct of the following results connecting (i) replicability, (ii) approximate DP, (iii) pure DP, (iv) online learning, and, (v) statistical queries.

Black-Box Transformations. 1. Pure DP can be efficiently3 transformed to Online [Gonen et al., 2019]. 2. Replicability can be efficiently transformed to Approximate DP [Ghazi et al., 2021b, Bun et al.,

2023, Kalavasis et al., 2023]. 3. SQ can be efficiently transformed to Approximate DP [Blum et al., 2005].4 4. SQ can be efficiently transformed to Replicable [Impagliazzo et al., 2022]. 5. Pure DP can be efficiently5 transformed to Replicable (this work, Theorem 5.2).

Caveats of Transformations. The transformations from pure ε-DP learners to replicable and online learners may incur exponential computation time in the representation dimension [Beimel et al., 2013] of the underlying hypothesis class. Regarding the approximate DP reduction to replicability: The transformation provided by Bun et al. [2023] uses correlation, which necessitates that the output space of the algorithm is finite. To be more precise, based on [Bun et al., 2023], there is an efficient transformation from approximate DP to perfectly generalizing algorithms. Next, the authors use correlated sampling to obtain a replicable learner as follows. Given a perfectly generalizing algorithm A and sample S, the correlated sampling strategy is applied to the distribution of outputs of A(S). Hence, the output space of A should be finite. In the PAC learning setting, to ensure that the algorithm has finite output space, one sufficient condition is that the domain X is finite. The correlated sampling step can be explicitly implemented via rejection sampling from the output space of A. The acceptance probability is controlled by the probability mass function of A(S). As a result, in general, it is not computationally efficient. For instance, if the finite input space to the correlated sampling strategy is {0, 1}d, then the runtime of the algorithm could be exp(d), since the acceptance probability is exponentially small in the dimension in the worst case. For the specific case of PAC learning, there is another transformation from approximate DP to replicability that holds for countable domains X that was proposed by Kalavasis et al. [2023], but this approach goes through the Littlestone dimension of the class and might not even be computable in its general form.

Separations. There is a concept class that can be learned efficiently 1. by a replicable PAC learner (this work, Theorem 2.1), but not an efficient online one under OWF [Blum, 1994]; 2. by an approximate DP PAC learner, but not an efficient online one under OWF [Bun, 2020]; 3. by an online learner, but not an efficient approximate DP PAC one under cryptographic assumptions6 [Bun et al., 2024]; 4. by a replicable PAC learner under uniform marginals (Impagliazzo et al. [2022]) or more general marginals (this work, Corollary 1.5), but not an efficient SQ one;

3The transformation is efficient with respect to the correctness and confidence parameters but could be inefficient with respect to other parameters of the class such as the representation dimension. 4The reduction is efficient as it simply requires adding Gaussian noise to each statistical query. However, it is impractical for most tasks of interest as the resulting privacy guarantees are relatively weak. 5The transformation is efficient with respect to the correctness and confidence parameters but could be inefficient with respect to other parameters of the class. 6The specific assumptions are technical and we omit the specific statements. Roughly, the assumptions lead to the possibility to build indistinguishability obfuscation for all circuits.

5. by an approximate DP PAC learner, but not an efficient replicable one under OWF [Bun et al.,

1.2 Related Work Replicability. Pioneered by Impagliazzo et al. [2022], there has been a growing interest from the learning theory community in studying replicability as an algorithmic property. Esfandiari et al. [2023a,b] studied replicable algorithms in the context of multi-armed bandits and clustering. Later, Eaton et al. [2023], Karbasi et al. [2023] studied replicability in the context of Reinforcement Learning (RL) and designed algorithms that achieve various notions of replicability. Recently, Bun et al. [2023] established statistical equivalences and separations between replicability and other notions of algorithmic stability such as differential privacy when the domain of the learning problem is finite and provided some computational and statistical hardness results to obtain these equivalences, under cryptographic assumptions. Subsequently, Kalavasis et al. [2023] proposed a relaxation of the replicability definition of Impagliazzo et al. [2022], showed its statistical equivalence to the notion of replicability for countable domains7 and extended some of the equivalences from Bun et al. [2023] to countable domains. Chase et al. [2023b], Dixon et al. [2023] proposed a notion of list-replicability, where the randomness is not shared across the executions of the algorithm, but the outputs are required to belong to a list of small cardinality instead of being identical. Both of these works developed algorithms that work in the realizable setting, and later Chase et al. [2023a] showed that, surprisingly, it is impossible to design list-replicable learning algorithms for infinite classes in the agnostic setting. Recently, Moran et al. [2023] established even more statistical connections between replicability and other notions of stability, by dividing them into two categories consisting of distribution-dependent and distribution-independent definitions. Recent work by Kalavasis et al. [2024] studies replicable algorithms for large-margin halfspaces.

Computational Separations & Transformations in Stability. The seminal result of Blum [1994] illustrated a computational separation between PAC and online learning. Later, Bun [2020] obtained a similar separation between private PAC and online learning. More recently, Bun et al. [2023] showed that there exists a concept class that admits an efficient approximate DP learner but not an efficient replicable one, assuming the existence of OWFs. Contributing to this line of work, our Theorem 2.1 is of similar flavor. Shifting our attention to SQ learning, there is an intuitive similarity between learning from noisy examples and private learning: algorithms for both problems must be robust to small variations in the data (see also Blum et al. [2005]). Parities are a canonical SQ-hard problem, meaning that no efficient SQ algorithm for this class exists. Kasiviswanathan et al. [2011] designed an efficient private learner for learning parities, which dispels the similarity between learning with noise and private learning. Georgiev and Hopkins [2022] showed that while polynomial-time private algorithms for learning parities are known, the failure probability of any such algorithm must be larger than what can be achieved in exponential time, or else NP = RP. Finally, recent work by Bun et al. [2024] gives a concept class that admits an online learner running in polynomial time with a polynomial mistake bound, but for which there is no computationally efficient approximate differentially private PAC learner, under cryptographic assumptions. Moving on to transformations, our Theorem 1.4 builds upon the result of Blanc et al. [2023] who designed a framework that transforms computationally efficient algorithms under uniform marginal distributions, to algorithms that work under some other distribution, where the complexity of the transformation scales with some particular notion of distance between the two distributions. Essentially, our result can be viewed as a replicable framework of the same flavor, with a small additional computational and statistical overhead that scales with the replicability parameter. Finally, our approach to transform a pure DP learner into a replicable learner (cf. Theorem 1.6) is inspired by Gonen et al. [2019], who provided a transformation from pure DP learners to online learners.

1.3 Notation In general, we use A to denote an algorithm. For unsupervised problems, we usually denote by D the distribution over input examples. In the case of supervised problems, we use C to denote the concept class in question, D to denote the marginal distribution of the feature domain X, and DXY to refer to the joint distribution over labeled examples (x, y) X Y. Throughout our work, we use α, β to refer to the error and failure probability parameters of the algorithm, ε, δ to denote the approximate

7We remark that this equivalence for finite domains can also be obtained, implicitly, from the results of Bun et al. [2023]. Connections between replicability and the Littlestone dimension go back to Ghazi et al. [2021b].

DP parameters, and ρ for the replicability parameter. In most cases, the feature domain X is a subset of a high-dimensional space and we use d to denote the dimension of that space.

2 Efficient Replicability and Online Learning

Our first main result shows that replicability and online learning are not computationally equivalent, assuming the existence of one-way functions. This hardness result is based on a construction from Blum [1994], who defined a concept class denoted by OWS, which is efficiently PAC learnable but not online learnable in polynomial time, assuming the existence of one-way functions. Blum s construction builds upon the Goldreich-Goldwasser-Micali pseudorandom function generator [Goldreich et al., 1986] to define families of one-way labeled sequences (σ1, b1), . . . , (σr, br) {0, 1}d {0, 1}, for some r = ω(poly(d)). These string-label pairs can be efficiently computed in the forward direction, but are hard to compute in the reverse direction. Specifically, for any i < j, it is easy to compute σj, bj given σi. On the other hand, it is hard to compute σi, bi given σj. The essence of the difficulty in the online setting is that an adversary can present to the learner the sequence in reverse order, i.e., σr, σr 1, . . . , σ1. Then, the labels bi are not predictable by a polynomial time learner. However, in the PAC setting, for any distribution over the sequence {(σi, bi)}i [r], a learner which is given n labeled examples can identify the string σi with smallest index i in the sample. Then, it can perfectly and efficiently predict the label of any string that comes after it. This approximately amounts to a (1 1/n) fraction of the underlying population. It is not hard to see that the PAC learner for OWS is not replicable, since the minimum index of a sample can vary wildly between samples. The high-level idea of our approach to making the algorithm replicable is to show that we can replicably identify an approximate minimum index and output the hypothesis that (efficiently) forward computes from that index. Our main result, proven in Appendix B, is as follows. Theorem 2.1. Let d N. The following hold:

(a) [Bun, 2020] Assuming the existence of one-way functions, the concept class OWS with input domain {0, 1}d cannot be learned by an efficient no-regret algorithm, i.e., an algorithm that for some η > 0 achieves expected regret E[RT ] poly(d) T 1 η using time poly(d, T) in every iteration.8 (b) (Theorem B.7) The concept class OWS with input domain {0, 1}d can be learned by a ρreplicable (α, β)-PAC learner with sample complexity m = poly(d, 1/α, 1/ρ, log(1/β)) and poly(m) running time.

3 Lifting Replicable Uniform Learners

In this section, we present our replicable lifting framework in further detail. First, we will need the following technical definition. Definition 3.1 (Closed under Restrictions). A concept class C of functions f : {0, 1}d {0, 1} is closed under restrictions if, for any f C , i [d] and b {0, 1}, the restriction fi=b remains in C , where fi=b(x) = f(x1, ..., xi 1, b, xi+1, ..., xd) for any x {0, 1}d. Our replicable lifting result works for concept classes that satisfy the closedness under restrictions property. Also, recall that a distribution D over {0, 1}d is monotone (Definition D.4) if whenever x y (x is greater than y in the partial ordering of the poset), it holds D(x) D(y). Our algorithm for lifting replicable uniform learners to replicable learners under some unknown distribution D is presented in Theorem 3.2. This lifting approach is inspired by the work of Blanc et al. [2023], who designed the non-replicable variant of this transformation. We note that our algorithm, similar to the algorithm of Blanc et al. [2023], only requires sample access to D for monotone distributions, while, for arbitrary non-monotone probability measures, our algorithm requires access to a conditional sampling oracle (cf. Definition D.5). Our replicable lifting theorem reads as follows. Theorem 3.2 (Lifting Replicable Uniform Learners). Consider a concept class C of functions f : {0, 1}d {0, 1} closed under restrictions. Suppose we are given black-box access to an algorithm such that for any α , ρ , β (0, 1), given poly(d, 1/α , 1/ρ , log(1/β )) samples from U,

(i) is ρ -replicable with respect to the uniform distribution U, (ii) PAC learns C under the uniform distribution to accuracy α and confidence β , and, (iii) terminates in time poly(d, 1/α , 1/ρ , log(1/β )).

8We remark that this stronger requirement in regret is necessary since the trivial random guessing algorithm achieves o(T) regret in finite domains (cf. Appendix A.3).

Let α, ρ (0, 1) and β (0, ρ/3). For m = poly(d, 1/α, 1/ρ, log(1/β)) and M = poly(d, 1/α, 1/ρ, log(1/β))O(ℓ), the following cases hold:

(a) If D is a monotone distribution over {0, 1}d representable by a depth-ℓdecision tree, there is an algorithm that draws M samples from D, is ρ-replicable with respect to D, PAC learns C under D with accuracy α and confidence β, and terminates in poly(M) time. (b) If D is an arbitrary distribution over {0, 1}d representable by a depth-ℓdecision tree, there is an algorithm that draws M labeled examples as well as M conditional samples (cf. Definition D.5) from D, is ρ-replicable with respect to D, PAC learns C with respect to D with accuracy α and confidence β, and terminates in poly(M) time.

The high-level idea of the reduction proceeds as follows. Let us consider the realizable PAC setting, i.e., the labels are consistent with some f C . Let us assume black-box access to a replicable uniform learner AU C for C and access to i.i.d. samples of the form (x, f (x)), where x D. Our goal is to (efficiently) obtain an algorithm AD C that achieves small misclassification error with respect to the unknown distribution D and target f and is also replicable under D. The promise is that D has a decision tree representation of depth ℓ.9 The first step is to draw enough samples (x, f (x)), x D, and compute the decision tree representation of D. This is an unsupervised learning task and it only uses the feature vectors of the training set. We design a replicable algorithm for this step (cf. Theorem E.7), which could be of independent interest. The next observation is crucial: any discrete distribution on {0, 1}d can be expressed as a mixture of uniform distributions conditioned on non-overlapping sub-cubes. To see this, notice that any root-to-leaf path in the estimated decision tree representation corresponds to a sub-cube of {0, 1}d and, conditioned on this path, the remaining coordinates follow a uniform law. Hence, after an appropriate re-sampling procedure, one can employ the black-box replicable learner AU C to any one of the leaves t of the tree decomposition of D and obtain a classifier ft. For this step, the fact that C is closed under restrictions is crucial. Intuitively, if we wish to implement this idea in a replicable manner and need overall replicability parameter ρ, it suffices to use the uniform replicable algorithm in each leaf with parameter O(ρ/2ℓ) since we make at most 2ℓcalls to the uniform PAC learner (the decision tree complexity of the target is ℓ). Finally, given a test example x D, one computes the leaf t that corresponds to the sub-cube that x falls into and uses the (replicable) output ft of the associated uniform PAC learner to guess the correct label. For the formal analysis, we refer to Appendix D and Appendix E.

4 Efficient Replicability and SQ Learning: Parities

In this section, we provide an application of the general lifting framework we described in Section 3. One of the main results of the seminal work of Impagliazzo et al. [2022] is that any SQ algorithm can be made replicable. This result allows a great collection of tasks to be solved replicably since the SQ framework is known to be highly expressive [Kearns, 1998]. The primary motivation of this section comes from the question of Impagliazzo et al. [2022] on whether the class of parities can be PAC learned by an efficient replicable algorithm when the marginal distribution D is not uniform over {0, 1}d. Let us define our concept class of interest.10

Definition 4.1 (Affine Parities). Let S [d] be some subset of [d] and b {0, 1} a bias term. Define f S,b : {0, 1}d {0, 1} to be the biased parity of the bits in S, namely f S,b(x) = b + P

i S xi. The concept class of affine parities over {0, 1}d is the set C = {f S,b : S [d], b {0, 1}}. For any distribution D over {0, 1}d, we consider the supervised learning problem Aff Parity, where the learner observes i.i.d. samples of the form (x, f (x)), where x D and f C .

As observed by Impagliazzo et al. [2022], there is a replicable algorithm for PAC learning parities (i.e., the subclass Parity obtained by setting b = 0 in Aff Parity) under the uniform distribution: draw roughly O(d) samples so that with high probability the dataset will contain a basis. Then, in two distinct executions, the sets that standard Gaussian elimination outputs will be the same. A similar approach works for the class of affine parities and is described below.

9By performing an iterated doubling if necessary, we may assume without loss of generality that ℓis known. 10We consider a superset of parities which we call affine parities as we would like our concept class to be closed under restrictions (cf. Definition 3.1).

Lemma 4.2. The concept class of affine parities Aff Parity over {0, 1}d admits a ρ-replicable algorithm that perfectly learns any concept with respect to the uniform distribution U with probability of success at least 1 β. The algorithm has O(poly(d, log 1/ρβ)) sample and time complexity. The algorithm that attains the guarantees of Lemma 4.2 is a simple adaptation of the Gaussian elimination for learning standard parities. We provide the pseudocode in Algorithm C.1 and defer the proof of correctness to Appendix C.1.

Application of Theorem 3.2 Since the class of affine parities over {0, 1}d is closed under restrictions (cf. Lemma C.8) and there is a learner for this class under uniform marginals that is replicable and efficient (cf. Lemma 4.2), we can obtain an algorithm that replicably PAC learns the class Aff Parity under more general distributions. In particular, the following result is an immediate application of our lifting framework. Corollary 4.3. Let α, ρ (0, 1) and β (0, ρ/3). Let X = {0, 1}d. For M = poly(d, 1/α, 1/ρ, 1/β)O(ℓ), the following cases hold: (a) For any (unknown) monotone distribution D over X with decision tree complexity ℓ, there exists an algorithm that is a ρ-replicable learner for the concept class Aff Parity under D, and requires M samples and running time poly(M) to get accuracy α and confidence β. (a) For any distribution D over X with decision tree complexity ℓ, there exists an algorithm that is a ρ-replicable learner for the concept class Aff Parity under D, and requires M labeled examples, M conditional samples from D and running time poly(M) to get accuracy α and confidence β.

Back to the Question of Impagliazzo et al. [2022]. Impagliazzo et al. [2022] raised the question of when parities over {0, 1}d can be efficiently PAC learned by a replicable algorithm under some marginal distribution D that is not uniform or, more broadly, that causes Gaussian elimination to be non-replicable. Our Corollary 4.3 makes progress on this question. First, we note that in our setting, we do not require knowledge of D. Second, one can design examples of (monotone) distributions for which our lifting framework produces a replicable polynomial time PAC learner but naive Gaussian elimination11 fails to be replicable with constant probability (cf. Theorem C.5). Note that even PAC learning parities for the distribution described in Theorem C.5 is SQ-hard. We believe that our lifting framework can be seen as a systematic way of bypassing some instabilities arising from the algebraic structure of standard Gaussian elimination. As a final remark, we note that one could potentially design much more complicated (still monotone) distributions than the one of Theorem C.5, where even various adaptations of Gaussian elimination (e.g., pre-processing of the dataset, data deletion) would fail to be replicable but our lifting framework would guarantee replicability (and efficiency, provided small decision tree complexity). On the other hand, it is not evident whether parity learning remains SQ-hard under these harder distributions.

5 Efficient Replicability and Private Learning

In this section, we study connections between efficient DP learnability and efficient replicable learnability of a concept class. As we mentioned in the introduction, Bun et al. [2023] and subsequently Kalavasis et al. [2023] established a statistical equivalence between approximate DP learnability and replicable learnability of a concept class when the domain is finite [Bun et al., 2023] or countable [Kalavasis et al., 2023]. Moreover, Bun et al. [2023] showed that this equivalence does not hold when one takes into account the computational complexity of these tasks. Proposition 5.1 (Section 4.1 in Bun et al. [2023]). There exists a class that is efficiently PAC learnable by an approximate DP algorithm but, assuming one-way functions exist, it cannot be learned efficiently by a replicable algorithm.

We remark that there is an efficient converse transformation, i.e., a computationally efficient transformation from a replicable learner to an approximate DP learner [Bun et al., 2023]. Inspired by Gonen et al. [2019], we ask whether one can transform a pure DP learner to a replicable one. Our main result, proven in Appendix F, is the following. Theorem 5.2 (From Pure DP Learner to Replicable Learner). Let X be some input domain, Y = {0, 1}, and DXY be a distribution on X Y that is realizable with respect to some concept class C . Let A be a pure DP learner that, for any α, ε, β (0, 1), needs m(α, ε, β, C ) =

11We note that the examples we design are difficult instances for the standard Gaussian elimination algorithm but some pre-processing of the dataset (e.g., deleting some constant fraction of samples) could potentially make Gaussian elimination replicable.

poly(1/α, 1/ε, log(1/β), dim(C )) i.i.d. samples from DXY and poly(m) running time to output a hypothesis that has error at most α, with probability 1 β in an ε-DP way. Then, for any α , ρ, β (0, 1) there is a ρ-replicable learner A that outputs a hypothesis with error at most α with probability at least 1 β and requires poly(1/α , 1/ρ, log(1/β ), dim(C )) i.i.d. samples from DXY and poly(1/α , 1/ρ, log(1/β )) exp(dim(C )) running time. In the above, we denote by dim(C ) some dimension that describes the complexity of the concept class C that arises in the sample complexity of our pure DP learner. A natural candidate is the representation dimension [Kasiviswanathan et al., 2011]. As we alluded to before, this transformation is efficient with respect to the parameters α, β, ρ. On the other side, the sample complexity is polynomial in the representation dimension but the running time is exponential. We leave as an open question if it is possible to avoid this dependence. We remark that in principle, one could use the reduction from Bun et al. [2023]. The catch is that this reduction is based on correlated sampling so it requires i) the output space of the algorithm to be finite and ii) even under finite output spaces, it needs exponential time in the size of that space.

6 Conclusion

In this work, we have studied the computational aspects of replicability and several connections to other important notions in learning theory including online learning, SQ learning, and DP PAC learning. We believe that there are several interesting questions left open from our work. First, it would be interesting to see if there is a computationally efficient transformation from online learners to replicable learners. Then, it would be important to derive replicable learners from pure DP learners which are efficient with respect to the complexity of the underlying concept class. Regarding parities, it is still open whether we can design efficient replicable algorithms for every distribution D.

Acknowledgments

Alkis Kalavasis was supported by the Institute for Foundations of Data Science at Yale. Amin Karbasi acknowledges funding in direct support of this work from NSF (IIS-1845032), ONR (N0001419-12406), and the AI Institute for Learning-Enabled Optimization at Scale (TILOS). Grigoris Velegkas was supported in part by the AI Institute for Learning-Enabled Optimization at Scale (TILOS). Felix Zhou acknowledges the support of the Natural Sciences and Engineering Research Council of Canada (NSERC).

Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private pac learning implies finite littlestone dimension. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 852 860, 2019.

Noga Alon, Mark Bun, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private and online learnability are equivalent. ACM Journal of the ACM (JACM), 69(4):1 34, 2022.

Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 2016.

Philip Ball. Is ai leading to a reproducibility crisis in science? Nature, 624(7990):22 25, 2023.

Amos Beimel, Kobbi Nissim, and Uri Stemmer. Characterizing the sample complexity of private learners. In Robert D. Kleinberg, editor, Innovations in Theoretical Computer Science, ITCS 13, Berkeley, CA, USA, January 9-12, 2013, pages 97 110. ACM, 2013. doi: 10.1145/2422436. 2422450. URL https://doi.org/10.1145/2422436.2422450.

Guy Blanc, Jane Lange, Ali Malik, and Li-Yang Tan. Popular decision tree algorithms are provably noise tolerant. In International Conference on Machine Learning, pages 2091 2106. PMLR, 2022a.

Guy Blanc, Jane Lange, Mingda Qiao, and Li-Yang Tan. Properly learning decision trees in almost polynomial time. Journal of the ACM, 69(6):1 19, 2022b.

Guy Blanc, Jane Lange, Ali Malik, and Li-Yang Tan. Lifting uniform learners via distributional decomposition. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1755 1767, 2023.

Avrim Blum, Adam Kalai, and Hal Wasserman. Noise-tolerant learning, the parity problem, and the statistical query model. Journal of the ACM (JACM), 50(4):506 519, 2003.

Avrim Blum, Cynthia Dwork, Frank Mc Sherry, and Kobbi Nissim. Practical privacy: the sulq framework. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 128 138, 2005.

Avrim L Blum. Separating distribution-free and mistake-bound learning models over the boolean domain. SIAM Journal on Computing, 23(5):990 1000, 1994.

Mark Bun. A computational separation between private learning and online learning. Advances in Neural Information Processing Systems, 33:20732 20743, 2020.

Mark Bun, Roi Livni, and Shay Moran. An equivalence between private classification and online prediction. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 389 402. IEEE, 2020.

Mark Bun, Marco Gaboardi, Max Hopkins, Russell Impagliazzo, Rex Lei, Toniann Pitassi, Satchit Sivakumar, and Jessica Sorrell. Stability is stable: Connections between replicability, privacy, and adaptive generalization. In Barna Saha and Rocco A. Servedio, editors, Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC 2023, Orlando, FL, USA, June 20-23, 2023, pages 520 527. ACM, 2023. doi: 10.1145/3564246.3585246. URL https: //doi.org/10.1145/3564246.3585246.

Mark Bun, Aloni Cohen, and Rathin Desai. Private PAC learning may be harder than online learning. In Claire Vernade and Daniel Hsu, editors, International Conference on Algorithmic Learning Theory, 25-28 February 2024, La Jolla, California, USA, volume 237 of Proceedings of Machine Learning Research, pages 362 389. PMLR, 2024. URL https://proceedings.mlr.press/ v237/bun24a.html.

Clément L Canonne, Dana Ron, and Rocco A Servedio. Testing probability distributions using conditional samples. SIAM Journal on Computing, 44(3):540 616, 2015.

Clément L Canonne, Xi Chen, Gautam Kamath, Amit Levi, and Erik Waingarten. Random restrictions of high dimensional distributions and uniformity testing with subcube conditioning. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 321 336. SIAM, 2021.

Zachary Chase, Bogdan Chornomaz, Shay Moran, and Amir Yehudayoff. Local borsuk-ulam, stability, and replicability. ar Xiv preprint ar Xiv:2311.01599, 2023a.

Zachary Chase, Shay Moran, and Amir Yehudayoff. Stability and replicability in learning. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 2430 2439. IEEE, 2023b. doi: 10.1109/FOCS57990.2023.00148. URL https://doi.org/10.1109/FOCS57990.2023.00148.

Peter Dixon, Aduri Pavan, Jason Vander Woude, and N. V. Vinodchandran. List and certificate complexities in replicable learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.

Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pages 642 669, 1956.

Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265 284. Springer, 2006.

Eric Eaton, Marcel Hussing, Michael Kearns, and Jessica Sorrell. Replicable reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.

Hossein Esfandiari, Alkis Kalavasis, Amin Karbasi, Andreas Krause, Vahab Mirrokni, and Grigoris Velegkas. Replicable bandits. In The Eleventh International Conference on Learning Representations, 2023a.

Hossein Esfandiari, Amin Karbasi, Vahab Mirrokni, Grigoris Velegkas, and Felix Zhou. Replicable clustering. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b.

Vitaly Feldman, Elena Grigorescu, Lev Reyzin, Santosh S Vempala, and Ying Xiao. Statistical algorithms and a lower bound for detecting planted cliques. Journal of the ACM (JACM), 64(2): 1 37, 2017.

Dimitris Fotakis, Alkis Kalavasis, and Christos Tzamos. Efficient parameter estimation of truncated boolean product distributions. In Conference on Learning Theory, pages 1586 1600. PMLR, 2020.

Dimitris Fotakis, Alkis Kalavasis, Vasilis Kontonis, and Christos Tzamos. Efficient algorithms for learning from coarse labels. In Conference on Learning Theory, pages 2060 2079. PMLR, 2021.

Dimitris Fotakis, Alkis Kalavasis, and Christos Tzamos. Perfect sampling from pairwise comparisons. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, Neur IPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.

Kristian Georgiev and Samuel Hopkins. Privacy induces robustness: Information-computation gaps and sparse mean estimation. Advances in Neural Information Processing Systems, 35:6829 6842, 2022.

Badih Ghazi, Noah Golowich, Ravi Kumar, and Pasin Manurangsi. Sample-efficient proper pac learning with approximate differential privacy. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 183 196, 2021a.

Badih Ghazi, Ravi Kumar, and Pasin Manurangsi. User-level differentially private learning via correlated sampling. In Marc Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pages 20172 20184, 2021b.

Surbhi Goel, Aravind Gollakota, and Adam Klivans. Statistical-query lower bounds via functional gradients. Advances in Neural Information Processing Systems, 33:2147 2158, 2020.

Oded Goldreich, Shafi Goldwasser, and Silvio Micali. How to construct random functions. Journal of the ACM (JACM), 33(4):792 807, 1986.

Alon Gonen, Elad Hazan, and Shay Moran. Private learning implies online learning: An efficient reduction. Advances in Neural Information Processing Systems, 32, 2019.

Themistoklis Gouleakis, Christos Tzamos, and Manolis Zampetakis. Faster sublinear algorithms using conditional sampling. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1743 1757. SIAM, 2017.

Anupam Gupta, Moritz Hardt, Aaron Roth, and Jonathan Ullman. Privately releasing conjunctions and the statistical query barrier. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 803 812, 2011.

Russell Impagliazzo, Rex Lei, Toniann Pitassi, and Jessica Sorrell. Reproducibility in learning. In Stefano Leonardi and Anupam Gupta, editors, STOC 22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 818 831. ACM, 2022. doi: 10.1145/3519935.3519973. URL https://doi.org/10.1145/3519935.3519973.

Svante Janson. Tail bounds for sums of geometric and exponential variables. Statistics & Probability Letters, 135:1 6, 2018.

Alkis Kalavasis, Amin Karbasi, Shay Moran, and Grigoris Velegkas. Statistical indistinguishability of learning algorithms. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 15586 15622. PMLR, 2023. URL https://proceedings.mlr.press/v202/ kalavasis23a.html.

Alkis Kalavasis, Amin Karbasi, Kasper Green Larsen, Grigoris Velegkas, and Felix Zhou. Replicable learning of large-margin halfspaces. ar Xiv preprint ar Xiv:2402.13857, 2024.

Amin Karbasi, Grigoris Velegkas, Lin Yang, and Felix Zhou. Replicability in reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.

Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What can we learn privately? SIAM Journal on Computing, 40(3):793 826, 2011.

Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983 1006, 1998.

Michael J Kearns and Umesh Vazirani. An introduction to computational learning theory. MIT press, 1994.

Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning, 2:285 318, 1988.

Pascal Massart. The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The annals of Probability, pages 1269 1283, 1990.

Shay Moran, Hilla Schefler, and Jonathan Shafer. The bayesian stability zoo. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.

Joelle Pineau, Koustuv Sinha, Genevieve Fried, Rosemary Nan Ke, and Hugo Larochelle. Iclr reproducibility challenge 2019. Re Science C, 5(2):5, 2019.

A Notation and Preliminaries

For standard PAC learning definitions, we refer to the book of Kearns and Vazirani [1994].

A.1 Replicable Learning Following the pioneering work of Impagliazzo et al. [2022], we consider the definition of a replicable learning algorithm. Definition 1.1 (Replicable Algorithm; Impagliazzo et al., 2022). Let R be a distribution over random strings. A learning algorithm A is n-sample ρ-replicable under distribution D if for two independent sets S, S Dn it holds that Pr S,S Dn,r R[A(S, r) = A(S , r)] ρ. We will say that A is replicable if the above holds uniformly over all distributions D. In words, A is replicable if sharing the randomness across two executions on different i.i.d. datasets yields the exact same output with high probability. We can think of A as a training algorithm which is replicable if by fixing the random seed, it outputs the exact same model with high probability. One of the most elementary statistical operations we may wish to make replicable is mean estimation. This operation can be phrased more broadly using the language of statistical queries. Definition A.1 (Statistical Query Oracle; Kearns, 1998). Let D be a distribution over the domain X and ϕ : X R be a statistical query with true value v := limn ϕ(X1, . . . , Xn) R. Here Xi i.i.d. D and the convergence is understood in probability or distribution. Let α, β (0, 1)2. A statistical query (SQ) oracle outputs a value v such that |v v | α with probability at least 1 β. The SQ framework appears in various learning theory contexts (see e.g., Blum et al. [2003], Gupta et al. [2011], Goel et al. [2020], Fotakis et al. [2021] and the references therein). In the SQ model, the learner interacts with an oracle in the following way: the learner submits a statistical query to the oracle and the oracle returns its true value, after adding some noise to it. The simplest example of a statistical query is the sample mean ϕ(X1, . . . , Xn) = 1 n Pn i=1 Xi. Impagliazzo et al. [2022] designed a replicable SQ-query oracle for sample mean queries with bounded co-domain. Esfandiari et al. [2023b] generalized the result to simultaneously estimate the means of multiple random variables with unbounded co-domain under some regularity conditions on their distributions (cf. Theorem B.4). The idea behind both results is to use a replicable rounding technique introduced in Impagliazzo et al. [2022] which allows one to sacrifice some accuracy of the estimator in exchange for the replicability property.

A.2 Private Learning Differential Privacy. A foundational notion of algorithmic stability is that of Differential Privacy (DP) [Dwork et al., 2006]. For a, b, ε, δ (0, 1), let a ε,δ b denote the statement a eεb + δ and b eεa + δ. We say that two probability distributions P, Q are (ε, δ)-indistinguishable if P(E) ε,δ Q(E) for any measurable event E. Definition A.2 (Approximate DP Algorithm; Kasiviswanathan et al., 2011). A learning algorithm A is an n-sample (ε, δ)-differentially private if for any pair of samples S, S (X {0, 1})n that disagree on a single example, the induced posterior distributions A(S) and A(S ) are (ε, δ)-indistinguishable. In the previous definition, when the parameter δ = 0, we say that the algorithm satisfies (pure) ε-DP. We remind the reader that, in the context of PAC learning, any hypothesis class C can be PAC-learned by an approximate differentially private algorithm if and only if it has finite Littlestone dimension, i.e., there is a qualitative equivalence between online learnability and private PAC learnability [Alon et al., 2019, Bun et al., 2020, Ghazi et al., 2021a, Alon et al., 2022].

A.3 Online Learning We consider the no-regret model of online learning. Recall Littlestone s model of (realizable) online learning [Littlestone, 1988] defined via a two-player game between a learner and an adversary. Let C be a given concept class. In each round of the game t = 1, . . . , T where T is a time horizon known to a (randomized) learner, the interaction is the following:

1) The adversary selects some features xt {0, 1}d. 2) The learner predicts a label ˆbt {0, 1}, potentially using randomization, 3) The adversary, who observes the distribution of the choice of the learner but not its realization, chooses the correct label bt = c(xt) under the constraint that there exists some c t C such that c τ(xτ) = bτ, for all τ t.

The goal of the learner is to minimize its regret, defined by

RT := max h C

t=1 1{ˆbt = c(xt)} 1{h(xt) = c(xt)}

t=1 1{ˆbt = c(xt)} 0 h = c

t=1 1{ˆbt = c(xt)}.

Thus the learner aims to compete with the best concept in hindsight from the class C . However, realizability ensures that the best such concept makes no mistakes so the regret is defined only in terms of the number of mistakes of the learner. We now introduce the definition of an efficient no-regret learning algorithm used by Bun [2020]. In the definition below, we write |c| to denote the length of a minimal description of the concept c. Definition A.3 (Efficient No-Regret Learning; [Bun, 2020]). We say that a learner efficiently no-regret learns C if there is some η > 0 such that for every adversary, it achieves expected regret

E[RT ] = poly(d, |c|)T 1 η

using time poly(d, |c|, T) in every round. There are two non-standard features of this definition. First, no-regret algorithms are typically only required to achieve sublinear regret o(T) in T, whereas we require it to be strongly sublinear T 1 η. A stronger condition like this is needed to make the definition nontrivial since the sample space is finite. Indeed, suppose T = 2d, then the trivial random guessing algorithm attains a regret bound of

= d T 2 log T = poly(d)o(T).

Many no-regret algorithms such as the multiplicative weights update algorithm achieve strongly sublinear regret. Second, it would be more natural to require the learner to run in poly(log T) time, the description length of the time horizon, rather than the value of T itself. The relaxed formulation only makes the separation stronger.

B Efficient Replicability and Online Learning

B.1 One-Way Sequences Our exposition follows that of Bun [2020]. For every dimension d N, Blum [1994] defines a concept class OWSd consisting of functions over the domain {0, 1}d that can be represented using poly(d) bits and evaluated in poly(d) time. The concepts of OWSd are indexed by bit strings s {0, 1}k, where k =

OWSd = cs : s {0, 1}k .

We will usually omit the index d when it is clear from the context. Each cs : {0, 1}d {0, 1} is defined using two efficiently representable and computable functions

G : {0, 1}k {0, 1}k {0, 1}d k ,

f : {0, 1}k {0, 1}k {0, 1}

that are based on the Goldreich-Goldwasser-Micali pseudorandom function family [Goldreich et al., 1986]. We omit the definition of these functions since it does not impede us towards our goal and refer the reader to Blum [1994] for details. Intuitively, G(i, s) computes the string σi described in the introduction and f(i, s) computes its label bi. We can think of s as the random seed which is generated from some source of true randomness which is then used to construct G, f.

For convenience, we identify {0, 1}k [2k]. Then cs is defined as

cs(i, σ) = 1, G(i, s) = σ, f(i, s) = 1 , 0, else .

We see that cs(i, σ) encodes both the string σi as well as its label bi, for every random seed s {0, 1}k. The two relevant properties of the strings σi are summarized below.

Proposition B.1 (Forward is Easy; Blum, 1994). There is an efficiently computable function

Compute Forward : {0, 1}k {0, 1}k {0, 1}d k {0, 1}d k {0, 1}

such that for every j > i,

Compute Forward(j, i, G(i, s)) = (G(j, s), f(j, s)).

Proposition B.2 (Reverse is Hard; Blum, 1994). Assuming the existence of one-way functions, there exist functions G : {0, 1}k {0, 1}k {0, 1}d k and f : {0, 1}k {0, 1}k {0, 1} satisfying the following. Let O be an oracle that on input (j, i, G(i, s)), outputs (G(j, s), f(j, s)) for any j > i. Let A be any polynomial time randomized algorithm and let AO denote the algorithm with access to the oracle O. For every i {0, 1}k,

Pr AO(i, G(i, s)) = f(i, s) 1

2 + negl(d),

where the probability is taken over the internal randomness of A and uniformly random s U {0, 1}k .

The above proposition states that no efficient algorithm can outperform random guessing when trying determinew the label of a string, even given access to a compute-forward oracle.

B.2 Hardness of Efficiently Online Learning OWS Blum [1994] used Proposition B.2 to show that OWS cannot be learned in the mistake bound model, a more stringent model of online learning compared to the no-regret setting. Later, Bun [2020] adapted the argument to the no-regret setting, making the separation stronger.

Theorem B.3 ([Bun, 2020]). Assuming the existence of one-way functions, OWS cannot be learned by an efficient no-regret algorithm.

Let |c| denote the length of a minimal description of the concept c. Recall that a learner efficiently no-regret learns a class C if there exists η > 0 such that for every adversary and any c C , it achieves E[RT ] = poly(d, |c|) T 1 η (strongly sublinear) using time poly(d, |c|, T) in every round. As mentioned in Appendix A.3, this requirement of strongly sublinear regret is necessary to make the definition non-trivial as the random guessing algorithm attains o(T) regret in finite domains.

B.3 Replicable Query Rounding Impagliazzo et al. [2022] designed replicable statistical query oracles (cf. Definition A.1) for bounded co-domains and Esfandiari et al. [2023b] generalized their results to multiple general queries with unbounded co-domain (cf. Theorem B.4), assuming some regularity conditions on the queries. We illustrate the idea behind the rounding procedure applied to the task of mean estimation. An initial observation is that across two executions, the empirical mean concentrates about the true mean with high probability. Next, we discretize the real line into equal-length intervals with a random shift. It can be shown that both points fall into the same random interval with high probability and outputting the midpoint of said interval yields the replicability guarantee. For completeness, we include the pseudocode and a proof of the replicable rounding procedure.

Theorem B.4 (Replicable Rounding; Impagliazzo et al., 2022, Esfandiari et al., 2023b). Let D be a distribution over some domain X. Let α, ρ (0, 1) and β (0, ρ/3). Suppose we have a sequence of statistical queries g1, . . . , g T : X R with true values µ1, . . . , µT and sampling n independent points from D ensures that max t [T ]|gt(x1, . . . , xn) µt| α

with probability at least 1 β.

Then there is a polynomial-time time postprocessing procedure r Round (cf. Algorithm B.1) such that each composition r Round(gt, α, ρ) is ρ-replicable. Moreover, the outputs ˆµt = r Round(gt(x1, . . . , xn), α, ρ) satisfy

max t [T ]|bµt µt| 4α

with probability at least 1 β. Note that if we wish for the sequence of rounding steps to be ρ-replicable overall, we can simply run each rounding step with parameter ρ/T.

Algorithm B.1 Replicable Rounding

1: r Round(query values g1, . . . , g T , accuracy α, replicability ρ): 2: L 6α/ρ 3: Sample L0 U[0, L) 4: Discretize the real line . . . , [L0 L, L0), [L0, L0 + L), [L0 + L, L0 + 2L), . . . into disjoint intervals 5: for t = 1, . . . , T do 6: Round gt to the midpoint of the interval, ˆµt 7: end for 8: return ˆµ RT .

Proof (Theorem B.4). Discretize the real line as disjoint intervals.

. . . , [ L, 0), [0, L), [L, 2L), . . . .

We will choose the value of L later. Consider adding a uniformly random offset L0 U[0, L) so the discretization becomes

. . . , [L0 L, L0), [L0, L0 + L), [L0 + L, L0 + 2L), . . . .

For each t T, we round the estimate gt(x1:n) := gt(x1, . . . , xn) to the midpoint of the interval it falls into. Let bµt be the rounded estimate that we output. From hereonforth, we condition on the event maxt|gt(x1:n) µt| α.

Replicability of Algorithm B.1. Fix t T. Consider the output across two runs ˆµt, ˆµ t. As long as the raw estimates gt(x1:n), gt(x 1:n) fall in the same interval, the outputs will be exactly the same. This occurs with probability |gt(x1:n) gt(x 1:n)| L 2α

Choosing L = 6α/ρ ensures this value is at most ρ/3. Accounting for the 2β 2ρ/3 probability of the estimates gt(x1:n) failing to concentrate, the output bµt is ρ-replicable.

Correctness of Algorithm B.1. The rounding incurs an additive error of at most L/2. The choice of L = 6α/ρ means the total error is at most

B.4 Replicable Quantile Estimation In this section we provide an algorithm that replicably estimates quantiles of a distribution. This will be useful in providing the computational separation between replicable PAC learning and online learning. We believe that it can have applications beyond the scope of our work. We first present a well-known concentration inequality regarding CDFs of random variables due to Dvoretzky, Kiefer, and Wolfowitz. Theorem B.5 (Dvoretzky Kiefer Wolfowitz Inequality; Dvoretzky et al., 1956, Massart, 1990). Let X1, . . . , Xn be i.i.d. random variables with CDF F. Let Fn denote the empirical distribution function given by Fn(x) := 1

n Pn i=1 1{Xi x}. For any α > 0,

Pr sup x R |Fn(x) F(x)| > α 2 exp 2nα2 .

In other words, we require at most n = (1/2α2) ln(2/β) samples to ensure that with probability at least 1 β, the empirical CDF uniformly estimates the true CDF with error at most α. The replicable quantile estimation algorithm for discrete and bounded distributions can be found in Algorithm B.2. The high-level idea is to perform a (replicable) binary search over the support of the distribution and to check whether the empirical CDF evaluated at some point x is above or below the target quantile q.

Algorithm B.2 Replicable Quantile Estimation

1: r Quantile Est(samples x1, . . . , xn, quantile q, accuracy α, replicability ρ, confidence β): 2: Fn(i) Pn j=1 1{xj i} // Implicitly for all i [R] 3: ℓ 0; h R 4: while ℓ< h 1 do 5: m (ℓ+h)/2 6: e Fn(m) r Round(Fn(m), αρ/4 log2(R), ρ/log2(R)) 7: if e Fn(m) q then 8: h m 9: else 10: ℓ m 11: end if 12: end while 13: return h.

Theorem B.6 (Replicable Quantile Estimation). Let α, ρ (0, 1) and β (0, ρ/3). Suppose we have access to

m = 16 log2 2(R) 2α2ρ2 ln 2

β = O log2 R

i.i.d. samples from some distribution over [R] with CDF F and q [0, 1] is the desired quantile level. Algorithm B.2 terminates in O(poly(n)) time, is ρ-replicable, and with probability at least 1 β, outputs some x [R] such that F(x) q α, F(x 1) < q + α. Our replicable quantile estimation differs from the replicable median algorithm of [Impagliazzo et al., 2022] in at least two ways. Firstly, the replicable median algorithm of [Impagliazzo et al., 2022] seems to rely heavily on properties of (approximate) medians in order to satisfy the approximation guarantees. On the other hand, our algorithm works regardless of the desired quantile. Secondly, their median algorithm relies on a non-trivial recursive procedure while our replicable quantile algorithm is considerably simpler and is based on a concentration of the CDF through the DKW inequality.

Proof (Theorem B.6). By the DKW inequality (Theorem B.5), sampling

m = 1 2(αρ/4 log2(R))2 ln 2

β points from our distribution ensures that

sup x [R] |Fn(x) F(x)| αρ 4 log2(R)

with probability at least 1 β. The proof is divided into two steps. We first show the correctness of our algorithm and then argue about its replicability.

Correctness of Algorithm B.2. The loop invariant we wish to maintain is that F(h) q α, F(ℓ) < q + α. This is certainly initially true since our distribution is over [R] = {1, . . . , R} and hence there is no mass at or below ℓ= 0 while all the mass is at or below h = R. Let m1, . . . , m T denote the midpoints chosen by binary search with T = log2(R). By Theorem B.4, if we condition on the success of all executions of r Round(Fn(mt), αρ/4 log2(R), ρ/log2(R)), then their outputs e Fn(mt) estimate F(mt) with additive error at most 4αρ/4 log2(R)

ρ/log2(R) = α.

We update h mt only if

q e Fn(mt) F(mt) + α = q α F(mt).

Similarly, we update ℓ mt only if

q > e Fn(mt) F(mt) α = q + α > F(mt).

Hence the loop invariant is maintained. At termination, ℓ= h 1 with

F(h) q α F(h 1) < q + α

by the loop invariant as desired.

Replicability of Algorithm B.2. By Theorem B.4, assuming all prior branching decisions are identical, each new branching decision in Algorithm B.2 is (ρ/log2(R))-replicable. Since we make at most log2(R) decisions, the entire algorithm is ρ-replicable.

B.5 An Efficient Replicable Learner for OWSd Equipped with the replicable quantile estimator from Theorem B.6, we are ready to present our efficient replicable OWSd PAC learner. We outline below a high-level description of our algorithm which can be found in Algorithm B.3.

1) We first replicably estimate the mass of all the elements that have positive labels. If it is much smaller than some threshold O(α), then we output the zero hypothesis. 2) Then we take sufficiently many samples to get enough data points with positive label and we run the replicable quantile estimation on the marginal distribution of features with positive label to get some x [2k]. 3) Next, we take sufficiently many samples to get a positive point at or below x and then we forward compute the label of x. 4) The hypothesis we return is the following: If its input (i, σ) has index less than i , it outputs 0. If its input is exactly (i , σ ), it outputs the previously computed label. Otherwise, it forward computes the label using i, σ.

Algorithm B.3 Replicable Learner for OWSd

1: r Learner OWSd(samples (i1, σ1), . . . , (in, σm), accuracy α, replicability ρ, confidence β): 2: Let S+ := (ij1, σj1), . . . , (ijn, σjn) be the subsequence of positive samples, where ijk ijk+1. 3: bp r Round(n/m, ρα/48, ρ/3) 4: if bp < α/2 then 5: return All-zero hypothesis. 6: end if 7: i r Quantile Est({ij1, . . . , ijn}, α/2, α/4, ρ/3, β/3) 8: if ij1 i then 9: return FAILURE 10: end if 11: (σ , b ) Compute Forward(i , ij1, σj1) 12: return hypothesis h(i, σ) := If i < i , output 0. If i = i , output b if σ = σ and 0 otherwise. If i > i , get (ˆσ,ˆb) Compute Forward(i, i , σ ) and output ˆb if σ = ˆσ and 0 otherwise.

Theorem B.7. Let ρ, α (0, 1) and β (0, ρ/3). Algorithm B.3 is an efficient ρ-replicable (α, β)- PAC learner for OWSd with sample complexity

m = max 392

and time complexity O(poly(m)).

Proof (Theorem B.7). The indicator random variable Ii := 1{σi = 1} is a bounded random variable and its sample mean is precisely n/m. Let p [0, 1] denote the probability mass of the positively labeled elements. By an Hoeffding inequality, |n/m p| α with probability at least

2 exp 2mα2 .

Since we have m 392

then |m/n p| ρα/48 with probability at least β/6. By Theorem B.4, the rounded estimate bp is ρ/3-replicable and satisfies |bp p| α/4 with probability at least β/6. From hereonforth, we condition on the success of this event.

Correctness of Algorithm B.3. First suppose bp < α/2. Then

p bp + α/4 < α.

Then Algorithm B.3 always returns the all-zero hypothesis. In this case, the returned hypothesis only makes mistakes on at most α fraction of the population and we are content. Otherwise, suppose that bp α/2. Thus

Let F denote the CDF of the conditional distribution over the positively labeled samples. By Theorem B.6, the output of r Quantile Est is an index i [2k] such that with probability at least β/6,

We proceed conditioning on the success of the call to r Quantile Est. Since F(i ) α/8, the probability that ij1 > i is at most 1 α

The last inequality is due to n αm/8 (4/α) ln(6/β). We proceed conditioning on the event ij1 i . So either ij1 = i and we know its string σ and label b , or ij1 < i and Proposition B.1 assures that we can obtain the string σ and label b through forward computation

(σ , b ) Compute Forward(i , ij1, σj1).

Now, consider the hypothesis h we output and its action on an input (i, σ): If i < i , then h always answers 0 so it is incorrect only on the positive labels. But this happens at most F(i 1) < α of the time. If i i , h is always correct. Hence the total population error is at most α as desired. We conditioned on three events, each of which has failure probability at most β/6. Hence the total probability of failure is at most β/2.

Replicability of Algorithm B.3. By Theorem B.4, the output bp is ρ/3 replicable. By Theorem B.6, the output i is ρ/3 replicable. From our analysis above, ij1 < i with probability at most β/6 in each of two executions. Thus the total probability of outputting different classifiers is at most

B.6 Proof of Theorem 2.1 We now restate and prove Theorem 2.1. Theorem 2.1. Let d N. The following hold:

(a) [Bun, 2020] Assuming the existence of one-way functions, the concept class OWS with input domain {0, 1}d cannot be learned by an efficient no-regret algorithm, i.e., an algorithm that for some η > 0 achieves expected regret E[RT ] poly(d) T 1 η using time poly(d, T) in every iteration.12 (b) (Theorem B.7) The concept class OWS with input domain {0, 1}d can be learned by a ρreplicable (α, β)-PAC learner with sample complexity m = poly(d, 1/α, 1/ρ, log(1/β)) and poly(m) running time.

Proof (Theorem 2.1). Combining the hardness of efficiently online learning OWS (cf. Theorem B.3) with the efficient replicable learner of Theorem B.7 completes the the proof of Theorem 2.1.

C Efficient Replicability and SQ Learning: Parities

C.1 The Proof of Lemma 4.2

Algorithm C.1 Replicable Learner for Affine Parities f(x) = w x+b under the Uniform Distribution

1: r Aff Parity(accuracy α, replicability ρ, confidence β): 2: Draw a single sample (x(0), f(x(0))) where x0 U. 3: Draw O(d log(1/ρβ)) samples (x, f(x)) so that with probability at least 1 β, we obtain d linearly independent offsets (z, y) := (x + x(0), f(x) + f(x(0))), say (z(1), y(1)), . . . , (z(d), y(d)). 4: Run Gaussian elimination to obtain the unique solution w such that w zi = yi for each i [d]. 5: Compute b = f(x(0)) + w x(0). 6: Return (w, b).

We now repeat and prove Lemma 4.2, which states the correctness of Algorithm C.1. Lemma 4.2. The concept class of affine parities Aff Parity over {0, 1}d admits a ρ-replicable algorithm that perfectly learns any concept with respect to the uniform distribution U with probability of success at least 1 β. The algorithm has O(poly(d, log 1/ρβ)) sample and time complexity.

Proof (Lemma 4.2). For any i [d], observe that

y(i) := f(x(i)) + f(x(0)) := w x(i) + b + w x(0) + b = w (x(i) + x(0)) =: w z(i)

Thus the dataset of offsets {(z(i), y(i))} uniquely determines the linear function w. Having learned w, we can recover the value of b by evaluating at any point in the original dataset, say at the first point.

C.2 Background on Parities and SQ Statistical Queries We start with some background on the SQ model. Definition C.1 (SQ Learning). A concept class C with input space {0, 1}d is learnable from statistical queries with respect to distribution D if there is a learning algorithm A such that for any c C and any α > 0, A produces an α-approximation of c from statistical queries; furthermore, the running time, the number of queries asked, and the inverse of the smallest tolerance used must be polynomial in d and 1/α.

Definition C.2 (SQ Hardness). Consider a concept class C with input space {0, 1}d. Fix a tolerance τ = poly(1/d) and accuracy α = O(1). We say that C is SQ-hard under distribution D if any SQ algorithm requires ω(poly(d)) queries of tolerance at least τ to α-learn C . The standard way to show SQ hardness is by showing lower bounds on the so-called SQ dimension [Feldman et al., 2017] of the corresponding problems (we omit the formal definition of SQ dimension as it is beyond the scope of the present work). Such lower bounds on this dimension establish lower bounds on the running time of any SQ algorithm for the problem not on its sample complexity.

12We remark that this stronger requirement in regret is necessary since the trivial random guessing algorithm achieves o(T) regret in finite domains (cf. Appendix A.3).

PAC and SQ Learning Parities The standard class of parities Parity is given by the subset {f S,0 : S [d]} of Aff Parity. The standard algorithm for learning parity functions works by viewing a set of n labelled examples as a set of n linear equations over the finite field with two elements F2. Then, Gaussian elimination is used to solve the system and thus find a consistent parity function. This algorithm is extremely brittle to noise: even for the uniform marginal distribution D = U, the problem of learning parities over {0, 1}d does not belong to SQ. In particular, a standard result is the following: Fact C.3 ([Kearns, 1998]). Even learning parities Parity Aff Parity with input space {0, 1}d is SQ-hard under the uniform distribution U.

C.3 Gaussian Elimination is not Replicable In this section, we give a simple example where Gaussian elimination fails to be replicable. Proposition C.4. For any d 3 and S [d] with 2 / S [d], the following holds: There is some nd 1, such that for any n nd, there exists a distribution D such that the n-sample Gaussian elimination algorithm for PAC-learning parities f(x) = P

i S xi in the realizable setting fails to be 1/17-replicable.

Proof (Proposition C.4). Let e1, . . . , ed be the standard basis of Zd 2, p (0, 2/3], and consider the distribution D such that

D(e1) = p, D(e2) = 2

3 p, D(ei) = 1 3(d 2), i 3.

The probability of not observing any e1 s after n samples is (1 p)n. We set p := 1 np

1/2 1/2 so that (1 p)n = 1/2. For sufficiently large nd N, we observe all ei, i 3 with probability at least 3/4 after n samples. Thus with probability at least 1

4, we observe e1, e3, . . . , ed, and with probability at least 1/4, we observe e3, . . . , ed but not e1. In the first case, we fully recover S since we are promised that 2 / S. In the second case, our algorithm is unable to determine if 1 S since we do not observe e1 and the best it can do is randomly guess. Thus the probability we output different classifiers is at least 2 1/4 1/4 1/2 = 1/16.

Proposition C.4 shows that Gaussian elimination fails in general to be replicable regardless of the number of samples requested.

C.4 Replicably Learning Affine Parities Beyond the Uniform Distribution We restate our main result.

Theorem C.5. For any dimension d, accuracy α, confidence β and replicability ρ, there exists some n = poly(d, 1/α, 1/ρ, log(1/β)) such that there exists a monotone distribution D over {0, 1}d which

(a) Gaussian elimination is not n-sample replicable with probability Ω(1) under D for the class Aff Parity, (b) there exists an n-sample ρ-replicable algorithm for (α, β)-PAC learning the class Aff Parity with respect to D with poly(n) runtime, and (c) even learning the class Parity is SQ-hard under the distribution D.

Before proving Theorem C.5, we derive three useful lemmas. Fix a dataset size n to be defined later. We pick the distribution D to be a Boolean product distribution (p1, ..., pd) [0, 1]d such that the first d 1 coordinates are unbiased (pi = 1/2) for i [d 1] and the last coordinate to be highly biased towards 1. In particular, pick pd = Dd({1}) = np

1/2 1/2. This distribution is monotone by construction.

Lemma C.6. The naive n-sample Gaussian elimination algorithm fails to be replicable with constant probability over the distribution D = (p1, ..., pd).

Proof. Consider two independent draws S1 and S2 from D of size n. With constant probability 1/2 1/2 = 1/4, S1 will contain vectors that have only 1 s in the last coordinate while S2 will contain at least one vector with a zero in the last direction. Then there are exactly two hypotheses that satisfy S1, one that contains the last coordinate and one that does not. This can be seen by reducing to an instance of the (d 1)-dimensional parities problem obtained by ignoring the last coordinate and flipping the bits of the labels.

On the other hand, for n sufficiently large, the subset of vectors S(1) 2 S2 that have 1 s in the last direction is satisfied by the same hypotheses as S1. However, only one of the two hypotheses also satisfies the other subset S(0) 2 S2 consisting of entries with a 0 in the last direction. Thus the candidate hypotheses that satisfy S1, S2 differ and the algorithm can at best output a random guess for S1, which will be inconsistent with the output on S2 with overall probability at least 1/2 1/4 = 1/8.

Lemma C.7. The decision tree complexity of D is Θ(1).

Proof. Since D is a product distribution, the pmf only takes on 2 different values and thus has depth Θ(1).

Lemma C.8. The class of affine parities Aff Parity with input space {0, 1}d is closed under restriction.

Proof. Let f be an arbitrary affine parity function, i.e., f(x) = b + P

i S xi for some b {0, 1} and S [d]. We remark that the operator + is in Z2. Consider an arbitrary direction i [d] and arbitrary bit b {0, 1}. We will verify that fi=b remains in the class of affine parities. If i / S, then fi=b = f Aff Parity. Now, if i S and c = 0, it holds that fi=b (x) = b + P

i S\{i} xi Aff Parity and if b = 1, it holds that fi=b (x) = (b + 1) + P

i S\{i} xi Aff Parity. Hence, affine parities are closed under restriction.

We are now ready to prove Theorem C.5.

Proof (Theorem C.5).

(a) By Lemma C.6.

(b) We can employ the lifting algorithm of Theorem 3.2 to replicably and efficiently learn affine parities since (i) the decision tree complexity is constant, (ii) the distribution is monotone, (iii) affine parities are closed under restriction and (iv) we have an efficient replicable algorithm for affine parities with respect to the uniform distribution (cf. Lemma 4.2).

(c) If we consider the subclass C = {f S,0 : S [d 1]}, we have that Ex D[f(x)g(x)] = 0 for any pair of distinct functions f, g C due to the product structure and the unbiasedness of the first d 1 coordinates of D. This implies the SQ-dimension of the class of parities is at least 2d 1 = Ω(2d) and so learning parities is SQ-hard under D.

D Lifting Replicable Uniform Learners

In this section, we will provide our general lifting framework for replicable learning algorithms that is needed in order to show Theorem C.5.

D.1 Preliminaries for Replicable Uniform Lifting We start this section with some definitions which are necessary to formally state our main result of Theorem 3.2.

Definition D.1 (Decision Tree (DT)). A decision tree T : {0, 1}d R is a binary tree whose internal nodes query a particular coordinate, and whose leaves are labelled by values. Each instance x {0, 1}d follows a unique root-to-leaf path in T: at any internal node, it follows either the left or right branch depending on the value of the queried coordinate, until a leaf is reached and its value T(x) is returned.

Definition D.2 (Decision tree distribution). We say that a distribution D over {0, 1}d is representable by a depth-ℓdecision tree, if its pmf is computable by a depth-ℓdecision tree T. Specifically, each leaf t is labelled by a value pt, so that D(x) = pt for all x t. This means that the conditional distribution of all points that reach a leaf is uniform.

Let us consider a distribution D over {0, 1}d. The definitions above suggest a natural measure of the complexity of D: The decision tree complexity.

Definition D.3 (Decision Tree Complexity). The decision tree complexity of a distribution D over {0, 1}d is the smallest integer ℓsuch that its probability mass function (pmf) can be represented by a depth-ℓdecision tree.

Next, we define a useful structured family of distributions. Let the binary relation denote pointwise comparison. Definition D.4 (Monotone Distribution). A probability distribution D over {0, 1}d is called monotone if for any x y it holds that D(x) D(y). Just as in Blanc et al. [2023], for arbitrary non-monotone probability measures, our algorithm requires access to a conditional sampling oracle (cf. Definition D.5) defined below. Definition D.5 (Conditional Sampling Oracle; Blanc et al., 2023). A conditional sampling oracle for a distribution D over {0, 1}d proceeds as follows: suppose we condition on some subset I of the d variables having a fixed value b {0, 1}I. In that case, the oracle generates a sample from the true conditional distribution, i.e., draws a sample x I D( | x I = b).

Definition D.6. For a tree T and leaves ℓ T, Eℓ T f(ℓ) := P

ℓ T 2 |ℓ|f(ℓ), where |ℓ| is the depth of the path to reach leaf ℓ. As an example of this notation, note that

Eℓ T [Ex Ud[f(x)|x ℓ]] = Eℓ T

where U(ℓ) is the uniform mass of the leaf ℓ, i.e., U(ℓ) = 2 n 2n |ℓ|. This implies that Eℓ T [Ex Ud[f(x)|x ℓ]] = Ex Udf(x), since the set of leaves partitions the set {0, 1}d.

D.2 Main Result: Lifting Replicable Uniform Learners We now restate our main result Theorem 3.2. For the sake of presentation, we defer its proof to Appendix E. Theorem 3.2 (Lifting Replicable Uniform Learners). Consider a concept class C of functions f : {0, 1}d {0, 1} closed under restrictions. Suppose we are given black-box access to an algorithm such that for any α , ρ , β (0, 1), given poly(d, 1/α , 1/ρ , log(1/β )) samples from U,

(i) is ρ -replicable with respect to the uniform distribution U, (ii) PAC learns C under the uniform distribution to accuracy α and confidence β , and, (iii) terminates in time poly(d, 1/α , 1/ρ , log(1/β )). Let α, ρ (0, 1) and β (0, ρ/3). For m = poly(d, 1/α, 1/ρ, log(1/β)) and M = poly(d, 1/α, 1/ρ, log(1/β))O(ℓ), the following cases hold:

(a) If D is a monotone distribution over {0, 1}d representable by a depth-ℓdecision tree, there is an algorithm that draws M samples from D, is ρ-replicable with respect to D, PAC learns C under D with accuracy α and confidence β, and terminates in poly(M) time. (b) If D is an arbitrary distribution over {0, 1}d representable by a depth-ℓdecision tree, there is an algorithm that draws M labeled examples as well as M conditional samples (cf. Definition D.5) from D, is ρ-replicable with respect to D, PAC learns C with respect to D with accuracy α and confidence β, and terminates in poly(M) time.

We emphasize that our result for monotone distributions D only requires sample access to D. For arbitrary non-monotone probability measures, our algorithm requires access to a conditional sampling oracle (cf. Definition D.5)

E Proof of Theorem 3.2

E.1 Preliminaries We start this section with some useful definitions, coming from the work of Blanc et al. [2023]. Definition E.1 (Weighting function of distribution). Let D be an arbitrary distribution over {0, 1}d. We define the weighting function f D(x) = 2d D(x). Remark that the scaled pmf satisfies

Ex U[f D(x)] = X

x {0,1}d 2 d 2d D(x) = 1.

For x {0, 1}d and i [d], we write x i to denote the binary vector obtained from x by flipping its i-th bit.

Definition E.2 (Influence of Variables on Distributions). Let D be a distribution over {0, 1}d and f D(x) = 2d D(x) be its pmf scaled up by the domain size. The influence of a coordinate i [d] on a distribution D over {0, 1}d is the quantity

Infli(f D) := Ex U |f D(x) f D(x i)|

We also define the total influence Infl(f D) = P

i [d] Infli(f D).

Intuitively, the influence of a variable measures how far that variable is from the uniform distribution. Suppose f D is computable by a depth-ℓdecision tree T. Then we can write for any x {0, 1}d

ℓ T pℓ 1{x ℓ}.

where the sum is taken over the leaves of T and pℓ:= D(y) is the (same) mass D assigns to every y ℓ. Then we have

Infli(f D) := Ex U|f D(x) f D(x i)|

x {0,1}d |D(x) D(x i)|

ℓ T pℓ 1{x ℓ} 1{x i ℓ} .

If i is not queried by any internal node of T, then x, x i always belongs to the same leaf and this value is 0. Otherwise, we can trivially bound this value by 2 using the triangle inequality. This discussion leads to the following observation. Fact E.3. If f D is computable by a depth-ℓdecision tree, then its total influence is at most

Infl(f D) 2ℓ.

For monotone distributions over the Boolean hypercube, influences have a convenient form: Proposition E.4 (Lemma 6.2 in Blanc et al. [2023]). If the distribution D over {0, 1}d is monotone, then for any i [d] Infli(f D) = ED[xi] . Definition E.5 (Restrictions). Given a sequence of (coordinate, value) pairs π = {(i1, b1), ..., (ik, bk)}, we use xπ to represent x with the coordinates in π overwritten/inserted with their respective values. For a function f : {0, 1}d R, we let fπ be the function that maps x to f(xπ). Definition E.6 (Everywhere τ-influential; Definition 4 in Blanc et al. [2022b]). For any function f : {0, 1}d R, influence threshold τ > 0 and decision tree T : {0, 1}d R, we say that T is everywhere τ-influential with respect to some f if for every internal node v of T, we have

Infli(v)(fv) τ .

Here i(v) denotes the variable queried at v and fv denotes the restriction of f by the root-to-v path in T, i.e. the variables in the path are fixed to the corresponding internal vertex values.

E.2 Replicable Proper Learner for Decision Tree distributions Our algorithm that achieves the bounds of Theorem 3.2 proceeds in a two-stage manner: we first replicably learn the decision tree structure of D and then use the replicable uniform-distribution learner to learn f restricted to each of the leaves of the tree. To carry out the first stage, we give an algorithm that replicably learns the optimal decision tree decomposition of a distribution D: Theorem E.7 (Replicably Learning DT distributions). Fix α, ρ (0, 1) and β (0, ρ/3). Let D be a distribution over {0, 1}d that is representable by a depth-ℓdecision tree. There is an algorithm that

(a) is ρ-replicable with respect to D, (b) returns a depth-ℓtree representing a distribution D such that TV(D, D ) α with probability at least 1 β over the draw of samples, (c) and has running time and sample complexity

N = poly(d, 1/α, 1/ρ, log(1/β)) (2ℓ/α)O(ℓ).

For monotone distributions, the algorithm only uses random samples from D, and for general distributions, it uses subcube conditional samples (cf. Definition E.12).

Proof (Theorem E.7). The proof of this result follows directly by applying either Lemma E.11 (for monotone distributions) or Lemma E.13 (for general distributions with subcube conditional sample access) to Theorem E.8, which are provided right after.

Essentially, making the decision tree learning routine replicable boils down to estimating influences on D in a replicable manner. This is the content of the upcoming results.

Theorem E.8 (Replicable Influence Replicable Decision Tree). Let D be a distribution that is representable by a depth-ℓdecision tree. For any ρ (0, 1) assume access to a ρ -replicable algorithm r Infl Est for estimating influences of distributions over {0, 1}d using poly(d, 1/α, 1/ρ , log(1/β)) samples and runtime. Then, for any ρ (0, 1), there is an algorithm r Build DT (cf. Algorithm E.1) that

(a) is ρ-replicable with respect to D, (b) returns a depth-ℓtree representing a distribution D such that TV(D, D ) α, with probability at least 1 β over the draw of samples, (c) and has running time and sample complexity

M = poly d, 1

α, d(2ℓ/α)O(ℓ)

ρ , log d(2ℓ/α)O(ℓ)

d (2ℓ/α)O(ℓ)

+ O (2ℓ/α)O(ℓ)

α2ρ2 log (2ℓ/α)O(ℓ)

= poly(d, 1/α, 1/ρ, log(1/β)) (2ℓ/α)O(ℓ).

Algorithm E.1 Replicably Building Decision Tree for D

1: r Build DT: 2: Input: Access to D and f D as in Definition E.1, restriction π (cf. Definition E.5), access to r Infl Est, depth ℓ, influence threshold τ, accuracy α, confidence β, replicability ρ. 3: Output: A decision tree T that minimizes Et T [Infl((f D)t)], where t T is the collection of leaves (each corresponding to some restriction) of T among all depth-ℓ, everywhere τ-influential (cf. Definition E.6) trees. 4: for i [d] do

5: r Infl Est[(f D)π, i] r Infl Est (f D)π, i, min(τ/4, α/2d), β 2d(2ℓ/α)O(ℓ) , ρ 2d(2ℓ/α)O(ℓ)

6: end for 7: Let S [d] be the set of variables i so that r Infl Est[(f D)π, i] 3τ/4. 8: if S = or ℓ= 0 then 9: // replicably estimate pπ := 2|π| Prx D[x is consistent with π] (cf. Theorem B.4) 10: ˆpπ α/2-accurate, ρ/[2(2ℓ/α)O(ℓ)]-replicable, β/[2(2ℓ/α)O(ℓ)]-confident estimate of pπ 11: return a new leaf node with value ˆpπ 12: else 13: for i S do 14: construct the tree Ti with 15: root(Ti) xi 16: left Subtree(Ti) r Build DT(D, π {xi = 1}, ℓ 1, τ, α, β, ρ) 17: right Subtree(Ti) r Build DT(D, π {xi = 1}, ℓ 1, τ, α, β, ρ) 18: end for 19: // Among {Ti}i S, return the one that minimizes the estimated total influence

20: // Et Ti h P

j [d] r Infl Est[(f D)t, j] i where (f D)t = (f D)π(t) 21: // for the restriction π(t) corresponding to path from the root the leaf t 22: // Let g(t) := P

j [d] r Infl Est[(f D)t, j]

23: i argmini S Et Ti[g(t)] = P

t Ti 2 |t|g(t) = P

x {0,1}d 2 dg(t)1{x t} 24: return Ti 25: end if

Before proving Theorem E.8, we state Theorem 3 of Blanc et al. [2023], from which the correctness of Algorithm E.1 follows. Let Built DT be the algorithm obtained from r Build DT (cf. Algorithm E.1) with the following adjustments.

(i) Replace the replicable influence estimator r Infl Est with any (possibly non-replicable) influence estimator and (ii) replace the replicable mean estimator of pπ with a simple (non-replicable) sample mean.

Proposition E.9 (Theorem 3 and Claim 5.5 in Blanc et al. [2023]). Fix α, β (0, 1). Let D be a distribution that is representable by a depth-ℓdecision tree. Given oracle access to an estimator for influences, the algorithm Build DT with the choice of τ = α/8ℓ2 returns a depth-ℓtree representing a distribution D such that TV(D, D ) α with probability at least 1 β. If the oracle terminates in unit time, the running time of the algorithm is d (ℓ/α)O(ℓ).

Remark E.10. A few remarks are in order regarding Algorithm E.1.

1. Blanc et al. [2023] identify the task of learning of a decision tree for a distribution D with learning its scaled pmf f = 2d D. In order to remain consistent with their convention, the values at the tree leaves produced by our replicable decision tree algorithm (cf. Algorithm E.1) also correspond to the scaled pmf. Note that we do not directly use the values at the leaves and hence this convention is immaterial to our application. 2. Estimating the influences to an accuracy of τ/4 ensures that any variables in S has influence at least τ/2 and every variable with influence at least τ is captured in S. Estimating the influences to an accuracy of α/2d ensures that the estimated total influences are accurate up to error at most α. Note that the computation for Et Ti[g(t)] (cf. Definition D.6) does not incur additional error since the weights of the sum are appropriately chosen. 3. pπ is the mean of a [0, 2|π|]-bounded random variable. By an Hoeffding bound and Theorem B.4, consuming poly(2ℓ, 1/α, 1/ρ , log(1/β )) samples from D suffices to estimate this quantity up to α/[2 2|π|]-accuracy, ρ -replicability, and β -confidence. 4. Taking into account the estimation error of influences, each variable in S has influence at least τ/2. For any depth-ℓdecision tree, the sum of all variable influences is at most 2ℓ (cf. Fact E.3). For any restriction π, (f D)π is a depth-ℓdecision tree and thus has at most 4ℓ/τ variables of influence at least τ/2. For the choice of τ = α/8ℓ2, there can be at most (4ℓ/τ)ℓ= (2ℓ/α)O(ℓ) recursive calls within Algorithm E.1. Each recursive call estimates d influences and at most 1 mean of a bounded random variable. All in all, we need to replicably estimate d (2ℓ/α)O(ℓ) variable influences and at most (2ℓ/α)O(ℓ) bounded means.

We are now ready to prove Theorem E.8.

Proof (Theorem E.8). By assumption, for any ρ (0, 1), r Infl Est is ρ -replicable and returns α-accurate estimates for influences with confidence β using poly(d, 1/α, log(1/β), 1/ρ ) samples and runtime. The replicability of Algorithm E.1 follows directly from the replicability of the influence estimators and the replicability of the standard rounding scheme (cf. Theorem B.4). By Remark E.10, it suffices to call the replicable influence estimator and the replicable mean estimator for pπ with replicability parameter ρ = ρ/[2d(2ℓ/α)O(ℓ)] for the union bound and similarly for the confidence parameter. The sample complexity follows from our remark above. Let us now condition on the event that each call of r Infl Est and estimation of expected total influence is successful. Conditioned on this event, the correctness of the algorithm r Build DT directly follows from Proposition E.9. In particular, if we condition on the event that each call to r Infl Est and each replicable estimate of pπ is successful, our algorithm reduces to the Build DT algorithm of Blanc et al. [2023].

E.3 Replicable Influence Estimator for Monotone Marginals

In this section, we provide the main subroutine of our replicable DT distribution learning algorithm. We begin with a replicable influence estimator for monotone distributions.

Lemma E.11 (Replicable Influence Estimator for Monotone Marginals). Fix i [d]. For any α, β, ρ (0, 1)3, there is an efficient ρ-replicable algorithm r Infl Est(D, i, α, β, ρ) such that given

an unknown monotone distribution D over {0, 1}d, computes an estimate of Infli(f D) up to accuracy α with probability at least 1 β using O(α 2ρ 2 log(1/β)) time and random samples from D.

Proof (Lemma E.11). By Proposition E.4, since D is monotone, we know that Infli(f D) = ED[xi]. Thus we can replicably estimate this expectation with O(α 2ρ 2 log(1/β)) samples and samplepolynomial runtime through replicable query rounding (cf. Theorem B.4).

E.4 Replicable Influence Estimator for Arbitrary Distributions We further design a replicable influence estimator for arbitrary distributions D over {0, 1}d using subcube conditional sampling. Definition E.12. A (subcube) conditional sampling oracle for distribution D over {0, 1}d receives as input a subset (subcube) S {0, 1}d and generates a sample from the conditional distribution

DS(x) := D(x | S) = D(x)1{x S}

This conditional distribution is essentially the truncated distribution DS with truncation set S. The subcube conditioning oracle is well studied by prior work in distribution testing and learning [Canonne et al., 2015, 2021, Fotakis et al., 2022, 2020, Gouleakis et al., 2017].

Algorithm E.2 Replicable Influence Estimation for Arbitrary Distributions via Conditional Sampling

1: r Infl Est: 2: Input: Distribution D over {0, 1}d and f D as in Definition E.1, coordinate i [d], access to conditional sampling oracle for D, desired accuracy α, desired confidence β, desired replicability ρ. 3: Output: A estimate v of Infli(f D) such that |v Infli(f D)| α with probability 1 β. 4: n O(α 2ρ 2 log(1/β)) 5: for j 1, ..., n do 6: Draw a fresh random sample x(j) D 7: S(j) {x(j)} {(x(j)) i}, where (x(j)) i is x(j) with its i-th coordinate flipped 8: Draw O(α 2) independent samples from the conditional on S(j) distribution D( |S(j)) 9: Let p(j) [0, 1] be the fraction that we observe x(j)

10: q(j) |p(j) (1 p(j))| 11: end for 12: q 1

n P j [n] q(j)

13: return v r Round(q, O(αρ), ρ) (see Theorem B.4 with T = 1)

An adaptation of the proof of Proposition 6.5 in Blanc et al. [2023] combined with the replicability properties of Theorem B.4 gives the following result, which we provide for completeness but do not use in our applications. The algorithm is presented Algorithm E.2. Lemma E.13 (Replicable Influence Estimator for Arbitrary Marginals). Let D be a distribution over {0, 1}d, f D as in Definition E.1, and, assume sample access to a subcube conditional oracle as in Definition E.12. For any α, β, ρ (0, 1) and i [d], Algorithm E.2 is ρ-replicable and, given O(α 4ρ 2 log(1/β)) subcube conditional samples from D, it computes in sample-polynomial time an estimate v that satisfies |v Infli(f D)| α with probability at least 1 β. As a proof sketch, let us call the above algorithm Repl Inf Est(D, i, α). The correctness of the algorithm follows from the observation that for any distribution D, coordinate i and α (0, 1), for any execution j [n], it holds that |Eq(j) Infli(f D)| α. Hence to obtain a high probability estimator, it suffices to obtain O(α 2 log(1/β)) copies of q(j) and take the average q. Finally, to make the algorithm replicable, it suffices to estimate the average q replicably. This can be accomplished at an extra cost of order 1/ρ2 using the standard rounding routines (cf. Theorem B.4). Since each iteration requires 1/α2 subcube samples, the sample complexity of our algorithm follows.

E.5 Useful Subroutines & Results In this section, we provide a set of useful results that we will use in our proofs. Proposition E.14 (Corollary D.21 in Esfandiari et al. [2023b]). Let α, ρ (0, 1) and β (0, ρ/3). There is a ρ-replicable algorithm r Finite Distr Est(D, α, β, ρ) that outputs parameter estimates p for a finite distribution D of support size N such that

(a) | p(i) p(i)| α for every i [N] with probability at least 1 β. (b) p(i) 0 for all i [N].

i p(i) = 1.

Moreover, the algorithm has sample complexity

m = O ln 1/β + N

and poly(m) time complexity. Proposition E.15 ([Janson, 2018]). Let Yi be a geometric variable with success rate q. Then Y := Pm i=1 Yi is the number of draws until we obtain m successes. Then

In other words, it suffices to perform O(m/q log(1/β)) Poisson trials before succeeding m times with probability at least 1 β. Proposition E.16 (Replicable Boosting of Success Probability). Let α, ρ (0, 1), β (0, ρ/3), and 0. Suppose A is a ρ -replicable (α + , 1/2)-PAC learner for the concept class C under a fixed distribution D over {0, 1}d that uses m(α, ρ ) = poly(d, 1/α, 1/ρ ) samples for any ρ (0, 1). There is a ρ-replicable (2α + , β)-PAC learner r Boost A(D, α, β, ρ) for C under D for any α, β, ρ (0, 1) that has sample complexity

m d, ρ 2 log(1/β)

+ O log2(1/β)

α2ρ2 log log(1/β)

= poly (d, 1/α, 1/ρ, log(1/β)) .

and sample-polynomial running time.

Proof. Let r Boost A be the algorithm that runs A for n = O(log 1/β) times (each with m = m(1/α, ρ/2n) samples), replicably estimates the population error of each hypothesis, and outputs the hypothesis which has the lowest estimated error. Running A for n = O(log 1/β) times with m samples per run guarantees that each execution is ρ/2n-replicable so the first step is ρ/2-replicable. Moreover, at least one of the hypotheses we output is (α + )-close to the optimal hypothesis with probability at least 1 β/2. By an Hoeffding bound on the empirical error of a hypothesis, we can estimate the population error of a hypothesis to an accuracy of αρ/4 2n and confidence β/2n using O(n2α 2ρ 2 log(n/β)) samples. By Theorem B.4, we can make this estimate ρ/2n-replicable at the cost of reducing the accuracy to α while maintaining the confidence parameter β/2n. It follows that the output hypothesis is ρ-replicable and with probability at least 1 β, its population error is at most 2α + .

E.6 Proof of Theorem 3.2 In this section, we show how to lift replicable algorithms that learn over the uniform distribution to replicable algorithms that learn with respect to arbitrary distributions, where the sample complexity and running time depend on the decision tree complexity of the target distribution. Let D( | T) denote the conditional distribution on leaves, i.e., D(t | T) := P

x t D(x). Now, our learning algorithm is designed for an exact tree distribution. However, we can only estimate the tree representation of the actual distribution, i.e., the conditional distribution at a leaf subcube is uniform in the tree representation but not necessarily uniform in the input distribution. To show that our learning algorithm can accommodate slight errors in the tree distribution estimation, we introduce the following definition. Definition E.17 (Robust Learning). For any concept class C and algorithm A, we say that A (α, β, c)-robustly learns C using m samples under the uniform distribution if for any η > 0 and class

Dη = {distribution D over {0, 1}d : TV(U, D) η} ,

it holds that A(α + cη, β)-learns C using m samples with respect to the distributions in Dη. Proposition E.18 (Proposition 7.2 from [Blanc et al., 2022b]). For any concept class C and algorithm A, if A (α, β)-PAC learns C using m samples under the uniform distribution, then the exact same algorithm A also (α, β + 1/3, 3m)-robustly learns C using m samples.

Algorithm E.3 Replicable Lifting

1: r Lift 2: Input: decision tree T (from r Build DT), algorithm A for ρ -replicable (α , β )-learning under the uniform distribution using m(α , ρ , β ) samples, distribution D, accuracy α, confidence β, replicability ρ. 3: Output: A hypothesis h : {0, 1}d {0, 1} 4: 5: M1 poly(2ℓ, 1/α, 1/ρ, log(1/β)) 6: M2 m(α/6, ρ/2ℓ, 1/6) poly(d, 2ℓ, 1/α, 1/ρ, log(1/β)) 7: // Estimate leaf distribution (cf. Proposition E.14) 8: ˆD( | T) r Finite Distr Est(D( | T), α/12 2ℓ, β/2, ρ/3) using M1 samples 9: Draw a dataset S DM2 of size M2. 10: for each leaf t T do 11: if ˆD(t | T) α/4 2ℓthen 12: Let St be the subset of samples in S that reach t. 13: Create a set S t consisting of points in St but where all coordinates queried on the root-to-leaf path for t are re-randomized independently, making the marginal distribution uniform. 14: For ρ (0, 1), choose its parameters so that A = A(α/6, ρ , 1/6) is ρ -replicable and (α/6, 1/6 + 1/3, c )-robustly learns C . 15: Get ht by calling r Boost A(S t, α/6, ρ/3 2ℓ, β/2 2ℓ) (cf. Proposition E.16). 16: else 17: Get ht that outputs a random guess according to shared randomness. 18: end if 19: end for 20: Return the hypothesis h such that given input x, finds the leaf t T that x follows and outputs ht(x).

We are now ready to state the main result of this section, from which the proof of Theorem 3.2 closely follows. Theorem E.19. Consider a concept class C of functions f : {0, 1}d {0, 1} closed under restrictions. Fix α, β, c, ρ > 0 and m, ℓ N. Suppose we are provided black-box access to an algorithm A that

(i) is ρ -replicable with respect to the uniform distribution for any ρ (0, 1), (ii) (α , β + 1/3, c)-robustly learns C for any α , β (0, 1), and (iii) consumes m(α , ρ , β ) = poly(d, 1/α , 1/ρ , log(1/β )) samples and computation time under the uniform distribution.

Let M1 = poly(2ℓ, 1/α, 1/ρ, log(1/β)), M2 = m(α/6, ρ/2ℓ, 1/6) poly(d, 2ℓ, 1/α, 1/ρ, log(1/β)), and

M := M1 + M2 poly(d, 2ℓ, 1/α, 1/ρ, log(1/β)).

For any function f C , distribution D over {0, 1}d, depth-ℓdecision tree T computing the pmf of a distribution DT where TV(D, DT ) α/6c, the following holds. The algorithm r Lift(T, A, D, α, β, ρ) (cf. Algorithm E.3) is ρ-replicable, has O(M) sample complexity, terminates in poly(M) time, and its output is α-close to f with respect to D with probability at least 1 β. Before we prove Theorem E.19, we state two useful lemmas. Lemma E.20. Fix α, ρ, ρ , β (0, 1), 0, and ℓ N. Let mr Boost = poly(d, 2ℓ, 1/α, 1/ρ, log(1/β)) be the number of samples that r Boost (cf. Proposition E.16) requires to boost a ρ -replicable (α/6 + , 1/2)-correct learner (that uses poly(d, 1/α, 1/ρ ) samples and running time) to a ρ/3 2ℓ-replicable learner with accuracy α/6 + + α/6 = α/3 + and confidence β/2 2ℓ. Condition on the success of r Finite Distr Est in Algorithm E.3. Then with probability at least 1 β/2, for every t T with ˆD(t | T) α/4 2ℓ, we observe at least mr Boost samples reaching t from S.

Proof. Condition on the success of the call to r Finite Distr Est. Then for every t T with ˆD(t | T) α/4 2ℓ, we have D(t | T) α/6 2ℓ. The probability of not observing such a t in a dataset

of size n is at most (1 α/6 2ℓ)n exp( nα/6 2ℓ). By a union bound over the at most 6 2ℓ/α such t, the probability of failing to collect each such t is at most

α exp( nα/6 2ℓ).

To drive this below the constant 1/2, it suffices to take n = O(ℓ 2ℓ/α log 1/α). By Proposition E.15, it suffices to take a sample of size

O(nmr Boost log 1/β) = O mr Boostℓ 2ℓ

to collect m copies of each such t with probability at least 1 β/2.

Lemma E.21 (Lemma B.4 from [Blanc et al., 2022a], Fact 7.4 from [Blanc et al., 2023]). For any distribution D and decision tree T computing the pmf of another distribution DT , X

t T Pr x D [x reaches t] TV(Dt, (DT )t) 2TV(D, DT ) ,

where (DT )t (resp. Dt) is the conditional of DT (resp. D) on the subcube induced by the leaf t. Note that in the statement of the lemma above, Prx D [x reaches t] = D(t | T) and (DT )t is the uniform distribution on the subcube represented by the leaf t. We are finally ready to prove Theorem E.19.

Proof (Theorem E.19). We first note that the sample complexity of the call to the routine r Finite Distr Est is M1 = poly(2ℓ, 1/α, 1/ρ, log(1/β)) (cf. Proposition E.14). Throughout this proof, we condition on the 1 β/2 probability of r Finite Distr Est succeeding. Let h be the output of r Lift(T, A, S). We now argue separately about the accuracy and replicability of the algorithm.

Accuracy. The accuracy of the learner h = r Lift(T, A, S) is analyzed as follows. For any leaf t of the tree T, let ht be the associated predictor. We have that

Pr x D[h(x) = f (x)] = X

t T Pr x D[x reaches t] Pr x Dt[ht(x) = f (x)] .

Each hypothesis ht is obtained either by running r Boost A (cf. Proposition E.16) on the sample S t given that ˆD(t | T) α/4 2ℓor corresponds to a random guess otherwise. On the other hand, from the choice of estimation error to the call for r Finite Distr Est, each leaf t on which we output the random guess hypothesis must satisfy Prx D[x reaches t] < α/3 2ℓ. The overall error contribution of such leaves is thus at most α/3. It follows that total error is at most

Pr x D[h(x) = f (x)] X

t T : ˆ D(t|T ) α/4 2ℓ Pr x D[x reaches t] Pr x Dt [ht(x) = f (x)] + α

Fix a t T on which we run r Boost A. Recall that Dt and (DT )t denotes the conditional distribution on the leaf subcube over the coordinates not fixed by π = π(t) where π is the restriction corresponding to t. Here the underlying distributions are the input distribution D and tree distribution DT . Let D t, (DT ) t denote the distributions over {0, 1}d obtained from Dt, (DT )t by re-randomizing the coordinates from π. Then (DT ) t is precisely the uniform distribution U over {0, 1}d and D t satisfies

= TV(D t, (DT ) t)

x {0,1}d |D t(x) (DT ) t(x)|

x t 2|π(t)||D t(x) (DT ) t(x)|

x t |Dt(x) (DT )t(x)| D t(x) = 2 |π(t)|Dt(x), (DT ) t(x) = 2 |π(t)|(DT )t(x)

= TV(Dt, (DT )t).

Based on the re-sampling step of the algorithm, we know that any point in S t is an i.i.d. sample from D t and labeled by the function f t , which lies in C thanks to closedness under restrictions. By (α/6, 1/6 + 1/3, c)-robust learnability, A in fact (α/6 + c TV(D t, U), 1/2)-PAC learns C under the distribution D t. We now apply Lemma E.20 with = c TV(D t, U): the leaves on which we run r Boost A have at least mr Boost samples where mr Boost = poly(d, 2ℓ, 1/α, 1/ρ, log(1/β)) is the number of samples that r Boost requires to boost a ρ -replicable (α/6 + c TV(D t, U), 1/2)-correct learner (that uses poly(d, 1/α, 1/ρ ) samples and running time) to a ρ/3 2ℓ-replicable learner with accuracy α/6 + c TV(D t, U) + α/6 = α/3 + c TV(D t, U) and confidence β/2 2ℓ. Then by Proposition E.16, with probability at least 1 β/2 2ℓover the data S t, the hypothesis ht output by r Boost A satisfies

Pr x Dt [ht(x) = f (x)] α/3 + c TV(D t, U) = α/3 + c TV(Dt, (DT )t) .

All in all, Combined with the expression of total error above, this means that with probability at least 1 β, Pr x D[h(x) = f (x)] 2α/3 + c X

t T Pr x D[x reaches t] TV(Dt, (DT )t) .

We finish by applying Lemma E.21, which ensures that the sum over leaves above is upper bounded by 2TV(D, DT ) α/3c and so the total misclassification error is at most α with probability at least 1 β.

Replicability. We assume we are given the same decision tree in both executions. Next, we note that r Finite Distr Est is ρ/3-replicable and succeeds in two executions with probability at least 1 2 β/2 1 ρ/3. Conditional on the events above, it suffices to show that for any leaf of the tree, the algorithm r Boost A replicably outputs a hypothesis. Since there are at most 2ℓleaves and each call of r Boost A has replicability parameter ρ/3 2ℓ, the entire procedure is ρ-replicable.

Before moving on to the proof of Theorem 3.2, we state one final lemma.

Lemma E.22. For any concept class C and algorithm A, if A ρ-replicably (α, β)-learns C using m samples under the uniform distribution, then the exact same algorithm A also ρ-replicably (α, β + 1/3, 3m)-robustly learns C using m samples under the uniform distribution.

Proof (Lemma E.22). The proof follows directly from the analysis of Proposition 7.2 in Blanc et al. [2023] (cf. Proposition E.18). The replicability of the algorithm is trivially preserved since the algorithm does not change.

We restate the statement of Theorem 3.2 below for convenience before its formal proof.

Theorem 3.2 (Lifting Replicable Uniform Learners). Consider a concept class C of functions f : {0, 1}d {0, 1} closed under restrictions. Suppose we are given black-box access to an algorithm such that for any α , ρ , β (0, 1), given poly(d, 1/α , 1/ρ , log(1/β )) samples from U,

(i) is ρ -replicable with respect to the uniform distribution U, (ii) PAC learns C under the uniform distribution to accuracy α and confidence β , and, (iii) terminates in time poly(d, 1/α , 1/ρ , log(1/β )). Let α, ρ (0, 1) and β (0, ρ/3). For m = poly(d, 1/α, 1/ρ, log(1/β)) and M = poly(d, 1/α, 1/ρ, log(1/β))O(ℓ), the following cases hold:

(a) If D is a monotone distribution over {0, 1}d representable by a depth-ℓdecision tree, there is an algorithm that draws M samples from D, is ρ-replicable with respect to D, PAC learns C under D with accuracy α and confidence β, and terminates in poly(M) time. (b) If D is an arbitrary distribution over {0, 1}d representable by a depth-ℓdecision tree, there is an algorithm that draws M labeled examples as well as M conditional samples (cf. Definition D.5) from D, is ρ-replicable with respect to D, PAC learns C with respect to D with accuracy α and confidence β, and terminates in poly(M) time.

Proof (Theorem 3.2). Using Theorem E.19, we can prove Theorem 3.2 as follows. First, we use Theorem E.7 to replicably learn the input distribution to total variation distance α/6 3m with a decision tree. Next, Lemma E.22 asserts that A is ρ-replicable and (α, β + 1/3, 3m)-robustly learns C under the uniform distribution. The result follows by applying Theorem E.19.

F Efficient Replicability and Private Learning

F.1 Replicably Learning Finite Classes We first state a useful result which we later leverage as a subroutine appears in Bun et al. [2023]. Theorem F.1 (Replicable Learner for Finite Classes; Theorem 5.13 in [Bun et al., 2023]). Let α, ρ, β (0, 1)3. Then, any finite concept class C is replicably learnable in the agnostic setting with

m(α, ρ, β) = O

log2 |C | + log 1

ρβ α2ρ2 log3 1

many samples. Moreover, the running time of the algorithm is polynomial in the number of samples and the cardinality of C .

F.2 A Transformation from Pure DP to Replicability Our approach borrows some high level ideas from Gonen et al. [2019] who showed a transformation from a pure DP PAC learner to an online learner. In a nutshell, our algorithm works as follows:

1) First, we create a dummy input dataset S which has some constant size13 and we run the algorithm for eΘ log3(1/β)

ρ2 times on S, resampling its internal randomness every time.

2) With probability 1 β, one of the hypotheses will have error at most 3/8 [Beimel et al.,

2013, Gonen et al., 2019]. 3) We run the replicable agnostic learner for finite classes from Bun et al. [2023]. 4) We boost the weak replicable learner using Impagliazzo et al. [2022], Kalavasis et al. [2023].

We begin by presenting an adaptation of a result that appeared in Beimel et al. [2013], Gonen et al. [2019]. Essentially, it states that when a class is learnable by a pure DP learner, one can fix an arbitrary input S that has constant size, execute the learner on this dataset a constant number of times by resampling its internal randomness, and then with some constant probability, e.g., 15/16, the output will contain one hypothesis whose error is bounded away from 1/2 by some constant, e.g., 1/4. This result might seem counter-intuitive at first glance, but learnability by a pure DP learner is a very strong property for a class and the result crucially depends on this property. Below, we present an adaptation of this result which states that the weak learner can be ρ-replicable and has a probability of success of 1 δ. An important element of our derivation is a result from Bun et al. [2023] about the sample and computational complexity of agnostically learning a concept class using a replicable algorithm (cf. Theorem F.1). We are now ready to state a key lemma for our transformation. Lemma F.2 (From Pure DP Learner to Replicable Weak Learner). Let X be some input domain, Y = {0, 1}, and DXY be a distribution on X Y that is realizable with respect to some concept class C . Let A be a pure DP learner such that for any α, ε, β (0, 1)3, A needs m(α, ε, β, C ) i.i.d. samples from DXY to output with probability at least 1 β a hypothesis with error at most α in an ε-DP way. Let us set m0 = m(1/4, 1/10, 1/2, C ). Then, for any ρ, β (0, 1)2 there is a ρ-replicable learner A that outputs a hypothesis with error

at most 3/8 with probability at least 1 β and requires e O poly(m0) log3(1/β )

ρ2 i.i.d. samples from

DXY and O(exp(m0) log(1/β )) oracle calls to A (and hence runtime).

Proof (Lemma F.2). We argue about the correctness and replicability of the algorithm separately.

Correctness. Since DXY is realizable, there exists some c C such that loss DXY (c ) = 0. Let range(A) = {c : X {0, 1} : S n N(X {0, 1})n, r n N{0, 1}n, s.t. A(S; r) = c} . Let also m0 = m(1/4, 1/10, 1/2, C ) be the number of samples A needs to achieve α = 1/4, ε = 1/10, β = 1/2 for the class C . Notice that m0 is a constant with respect to α, β, ρ, ϵ. Let also C (DXY ) = {c range(A) : loss DXY (c) 1/4} . By definition, with probability at least 1/2 over the random draw of an i.i.d. sample of size m0 from DXY and the internal randomness of A, the output of the algorithm belongs to C (DXY ). Thus, there exists a sample S0 (X {0, 1})m0 such that Pr r R[A(S0; r) C (DXY )] 1/2 .

13As in Gonen et al. [2019], the size is constant w.r.t. the accuracy, confidence parameters, but it might not be constant w.r.t. some parameter the describes the complexity of the concept class.

Although it is unclear how to identify such a S0, notice that any sample S (X {0, 1})m0 is an m0-neighbor of S0. Since A is pure DP, it holds that

Pr r R[A(S; r) C (DXY )] 1/2 e 0.1 m0 .

In particular, this holds for the dummy dataset S we fix before observing any data. Then if we sample N = 2 e0.1m0 log(3/β ) = Θ(exp(m0) log(1/β )) i.i.d. random strings r1, . . . , r N from R i.i.d. the probability that none of the hypotheses {A( S; ri)}i [N] is in C (DXY ) is at most

(1 1/2 exp( 0.1m0))2 exp(0.1m0) log(1/β ) β

We denote the event that one of the outputs is in C (DXY ) by E and we condition on it for the rest of the correctness proof. The next step is to replicably learn the finite concept class C = {A( S; r1), . . . , A( S; r N)} we have constructed using Theorem F.1 with accuracy 1/8, confidence

β /3, and replicability ρ. Thus, we can see that O log2 N+log 1 βρ ρ2 log3 1

= e O m2 0 log(1/β )

samples suffice for this task. Let E be the event that this estimation is correct, which happens with probability at least β /3. Conditioned on that event, outputting the hypothesis generated by the learner gives a solution with error at most 1/4+ 1/8 = 3/8. Notice that the good events happen with probability at least 1 2β /3 1 β, so the correctness condition is satisfied.

Replicability. Notice that the first step of the algorithm where we fix a dummy dataset S of size m0 and run the algorithm on i.i.d. strings of its internal randomness is trivially replicable, since the dataset and the randomness are the same across the two executions. Then, by the guarantees of Theorem F.1, we know that the output of the algorithm will be the same across two executions with probability at least 1 ρ. Thus, overall we see that with probability at least 1 ρ, the output of our algorithm is the same across the two executions.

We emphasize that the number of oracle calls to the DP algorithm is constant with respect to the confidence and the correctness parameter, but could, potentially, be exponential with respect to some parameter that depends on the representation of the underlying concept class X. Equipped with the weak learner from Lemma F.2 we can boost its error parameter using the boosting algorithm that appears in Impagliazzo et al. [2022]. For completeness, we state the result below.

Theorem F.3 (Replicable Boosting Algorithm [Impagliazzo et al., 2022]). Let DXY be the joint distribution over labeled examples as in Lemma F.2. Fix α, ρ, β > 0. Let A be a ρ-replicable (γ, β)- weak learner with sample complexity m(γ, ρ, β), i.e., its error is at most 1/2 γ, with probability at least 1 β. Then, there exists an efficient ρ-replicable boosting algorithm A such that with probability at least 1 β it outputs a hypothesis h with loss DXY (h) α. The algorithm runs for T = e O(log(1/β)/(αρ2)) rounds and requires T oracle calls to the weak learner. Moreover, it requires

e O m(γ, ρ/6T, β)

α2γ2 + 1 ρ2α3γ2

many samples, where m(γ, ρ/6T, β) is the sample complexity needed for the weak learner to be ρ/6T-replicable and (1/2 γ, β)-accurate.

Combining the discussion above, we see that for any α, ρ, β (0, 1)3 we can transform a pure DP learner to a ρ-replicable (α, β)-correct (strong) learner where the transformation is efficient with respect to the accuracy α, confidence β, and replicability ρ. This proves Theorem 5.2, which we restate below for convenience.

Theorem 5.2 (From Pure DP Learner to Replicable Learner). Let X be some input domain, Y = {0, 1}, and DXY be a distribution on X Y that is realizable with respect to some concept class C . Let A be a pure DP learner that, for any α, ε, β (0, 1), needs m(α, ε, β, C ) = poly(1/α, 1/ε, log(1/β), dim(C )) i.i.d. samples from DXY and poly(m) running time to output a hypothesis that has error at most α, with probability 1 β in an ε-DP way. Then, for any α , ρ, β (0, 1) there is a ρ-replicable learner A that outputs a hypothesis with error at most α with probability at least 1 β and requires poly(1/α , 1/ρ, log(1/β ), dim(C )) i.i.d. samples from DXY and poly(1/α , 1/ρ, log(1/β )) exp(dim(C )) running time.

We reiterate that even though our transformation is efficient with respect to the input parameters α, β, ρ, it might not be efficient with respect to some parameter that depends on the representation of the concept class. This is because the size of the dummy dataset m0 could depend on the size of that representation and our transformation requires poly(m0) samples but exp(m0) running time. This is also the case for transformation from a DP learner to an online learner [Gonen et al., 2019].

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The main body and appendix of the submission provide proofs for all the claims. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We formally define the mathematical model our results hold for. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We present full proofs in the main body and the appendix. Guidelines:

The answer NA means that the paper does not include theoretical results.

All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: [NA] Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example

(a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: [NA] Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details.

While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: [NA] Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: [NA] Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: [NA] Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: [NA] Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: The work is primarily theoretical and does not have any immediate societal impact. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: [NA] Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: [NA] Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: [NA] Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA] Justification: [NA] Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: [NA] Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.