# replicable_learning_of_largemargin_halfspaces__95fda5ed.pdf

Replicable Learning of Large-Margin Halfspaces

Alkis Kalavasis * 1 Amin Karbasi * 1 2 Kasper Green Larsen * 3 Grigoris Velegkas * 1 Felix Zhou * 1

We provide efficient replicable algorithms for the problem of learning large-margin halfspaces. Our results improve upon the algorithms provided by Impagliazzo, Lei, Pitassi, and Sorrell (STOC, 2022). We design the first dimensionindependent replicable algorithms for this task which runs in polynomial time, is proper, and has strictly improved sample complexity compared to the one achieved by Impagliazzo et al. (2022) with respect to all the relevant parameters. Moreover, our first algorithm has sample complexity that is optimal with respect to the accuracy parameter 𝜖. We also design an SGDbased replicable algorithm that, in some parameters regimes, achieves better sample and time complexity than our first algorithm. Departing from the requirement of polynomial time algorithms, using the DP-to-Replicability reduction of Bun, Gaboardi, Hopkins, Impagliazzo, Lei, Pitassi, Sivakumar, and Sorrell (STOC, 2023), we show how to obtain a replicable algorithm for large-margin halfspaces with improved sample complexity with respect to the margin parameter 𝜏, but running time doubly exponential in 1/𝜏2

and worse sample complexity dependence on 𝜖 than one of our previous algorithms. We then design an improved algorithm with better sample complexity than all three of our previous algorithms and running time exponential in 1/𝜏2.

1. Introduction

The replicability crisis is omnipresent in many scientific disciplines including biology, medicine, chemistry, and, importantly, AI (Baker, 2016; Pineau et al., 2019). A recent article that appeared in Nature (Ball, 2023) explains

*Equal contribution 1Yale University 2Google Research 3Aarhus University. Correspondence to: Alkis Kalavasis <alvertos.kalavasis@yale.edu>, Grigoris Velegkas <grigoris.velegkas@yale.edu>, Felix Zhou <felix.zhou@yale.edu>.

Proceedings of the 41 𝑠𝑡International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

how the reproducibility crisis in AI has a cascading effect across many other scientific areas due to its widespread applications in other fields such as medicine. Thus, an urgent goal is to design a formal framework through which we can argue about the replicability of experiments in ML. Such a theoretical framework was proposed in a recent work by Impagliazzo et al. (2022) and has been studied extensively in several learning settings (Esfandiari et al., 2023b;a; Bun et al., 2023; Kalavasis et al., 2023; Chase et al., 2023b; Dixon et al., 2023; Chase et al., 2023a; Eaton et al., 2023; Karbasi et al., 2023).

Definition 1.1 (Replicability (Impagliazzo et al., 2022)). Let ℛbe a distribution over random binary strings. A learning algorithm 𝒜is 𝑛-sample 𝜌-replicable if for any distribution 𝒟over inputs and two independent datasets 𝑆, 𝑆 𝒟𝑛, it holds that Pr𝑆,𝑆 𝒟𝑛,𝑟 ℛ[𝒜(𝑆, 𝑟) = 𝒜(𝑆 , 𝑟)] 𝜌.

In words, this definition requires that when an algorithm 𝒜 is executed twice on different i.i.d. datasets 𝑆, 𝑆 but using shared internal randomness, then the output of the algorithm is exactly the same, with high probability. We note that sharing the randomness across the executions is crucial in achieving this replicability guarantee. Importantly, Dixon et al. (2023) showed that without sharing randomness, it is impossible to achieve such a strong notion of replicability even for simple tasks such as mean estimation. In practice, this can be achieved by simply publishing the random seed that the ML algorithms are executed with. As we extensively discuss in Section 1.2, Definition 1.1 turns out to be connected with other notions of stability such as differential privacy and perfect generalization (Ghazi et al., 2021; Bun et al., 2023).

In this work, we study the fundamental problem of learning large-margin halfspaces, which means that no example is allowed to lie too close to the separating hyperplane. This task is related to foundational ML techniques such as the Perceptron algorithm (Rosenblatt, 1958), SVMs (Cortes and Vapnik, 1995), and Ada Boost (Freund and Schapire, 1997). Let us recall the concept class of interest.

Definition 1.2 (Large-Margin Halfspaces). Let 𝒟be a distribution over R𝑑 { 1, 1} whose support does not contain 𝑥= 0. We say that 𝒟has linear margin 𝜏if there exists a unit vector 𝑤 R𝑑such that for any (𝑥, 𝑦) supp(𝒟) it

Replicable Learning of Large-Margin Halfspaces

Table 1.1. Replicable Algorithms for Large-Margin Halfspaces. We denote by 𝑑the dimension, 𝜖the accuracy, 𝜌the replicability, and 𝜏the margin of the halfspace. We omit the logarithmic factors on the sample complexity. See Appendix A for a more detailed comparison.

ALGORITHM SAMPLES PROPER?

[ILPS22] (INEFFICIENT) (𝑑𝜖 3𝜏 8𝜌 2)1.01 NO

[ILPS22] (𝑑3𝜖 4𝜏 10𝜌 2)1.01 NO

ALGORITHM 1

THEOREM 1.3 𝜖 1𝜏 7𝜌 2 YES

ALGORITHM 2

THEOREM 1.4 𝜖 2𝜏 6𝜌 2 YES

DP REDUCTION [LNUZ20; BGH+23]

PROPOSITION 1.5

(INEFFICIENT)

𝜖 2𝜏 4𝜌 2 YES

ALGORITHM 3

THEOREM 1.6 (INEFFICIENT) 𝜖 1𝜏 4𝜌 2 YES

holds that 𝑦(𝑤 𝑥/ 𝑥 ) 𝜏.1

Following the PAC learning definition of Valiant (1984), we say that an algorithm learns with accuracy 𝜖and confidence 𝛿the class of 𝜏-margin halfspaces in 𝑑dimensions using 𝑛= 𝑛(𝜖, 𝛿, 𝑑, 𝜏) samples and runtime 𝑇= 𝑇(𝜖, 𝛿, 𝑑, 𝜏) if, given 𝑛i.i.d. samples from any distribution 𝒟satisfying Definition 1.2, the algorithm outputs, in time 𝑇, a classifier ℎ: R𝑑 { 1, 1} such that Pr(𝑥,𝑦) 𝒟[ℎ(𝑥) = 𝑦] 𝜖, with probability at least 1 𝛿.

We are interested in replicably learning large-margin halfspaces, i.e., designing algorithms for large-margin halfspaces that further satisfy Definition 1.1. We remark that when the feature domain is infinite, there is no replicable learning algorithm for learning halfspaces in general. Thus making some assumptions like the large-margin condition is necessary. In particular, Bun et al. (2023); Kalavasis et al. (2023) show that finiteness of the Littlestone dimension is a necessary condition for learnability by replicable algorithms, and it is known that even one-dimensional halfspaces over [0, 1] have infinite Littlestone dimension. See Table 1.1 for a comparison of prior work and our contributions.

The work of Impagliazzo et al. (2022) provided the first replicable algorithms for 𝜏-margin halfspaces over R𝑑. The

1When we do not specify the norm, we assume the ℓ2-norm.

first algorithm of Impagliazzo et al. (2022), which uses the foams discretization scheme (Kindler et al., 2012), is 𝜌-replicable and returns a hypothesis ℎthat, with probability at least 1 𝜌, satisfies Pr(𝑥,𝑦) 𝒟[ℎ(𝑥) = 𝑦] 𝜖. The sample complexity of this algorithm is roughly 𝑂((𝑑𝜖 3𝜏 8𝜌 2)1.01) and the runtime is exponential in 𝑑and polynomial in 1/𝜖, 1/𝜌and 1/𝜏. The second algorithm of Impagliazzo et al. (2022), which uses the box discretization scheme, is 𝜌-replicable and returns a hypothesis ℎthat, with probability at least 1 𝜌, satisfies Pr(𝑥,𝑦) 𝒟[ℎ(𝑥) = 𝑦] 𝜖with sample complexity 𝑂((𝑑3𝜖 4𝜏 10𝜌 2)1.01) and runtime which is polynomial in 𝑑, 1/𝜖, 1/𝜌and 1/𝜏. These two algorithms appear in the first two rows of Table 1.1.

Some remarks are in order. In our setting, the sample complexity of learning large-margin halfspaces in the absence of the replicability requirement is 𝑂(1/𝜖𝜏2). Notice that the sample complexity of both algorithms of Impagliazzo et al. (2022) depends on the dimension 𝑑of the problem. This is unexpected since the sample complexity of non-replicable algorithms for this task is dimension-independent. In the case of the replicable algorithms of Impagliazzo et al. (2022), the dependence on the dimension appears due to a rounding/discretization step, which is crucial in establishing the replicability guarantees. Second, both algorithms of Impagliazzo et al. (2022) are improper in the sense that the hypothesis ℎthey output does not correspond to a halfspace. This is due to the use of a replicable boosting routine that outputs the majority vote over multiple halfspaces. As a general note, both of these algorithms are fairly complicated: they require multiple discretization/rounding steps, and then they output a weak learner, which finally needs to be boosted using multiple rounds of a replicable boosting scheme. As a result, the sample complexity of their algorithms incurs a significant blow-up in the parameters 𝜖, 𝜏 compared to the non-replicable setting.

We work in the setting of finite (but not necessarily bounded) bit precision with the goal of designing algorithms that are agnostic to the marginal distribution and sample complexity that is dimension-independent. Indeed, assuming bounded bit precision 𝑏, possibly some structure about the marginal distribution, and that the margin 𝜏 > 0, there are replicable algorithms for learning halfspaces with time and sample complexity 𝑂(poly(𝑑, 𝑏, 1/𝜖, log(1/𝜏), 1/𝜌)) via black-box transformations of existing (complicated) SQ algorithms (Balcan and Feldman, 2015; Dunagan and Vempala, 2004). In the regime when 𝜏is constant and 𝑑is relatively large, our algorithms can outperform the black-box transformations.

1.1. Our Contribution

Impagliazzo et al. (2022) leave as an open question whether

Replicable Learning of Large-Margin Halfspaces

the sample complexity bounds of their algorithms are tight. In this work, we show that these bounds are sub-optimal. We provide new replicable algorithms for learning largemargin halfspaces that improve upon the results of Impagliazzo et al. (2022) in various aspects. Our algorithms have no sample dependence on 𝑑, strictly improve on the dependence on 1/𝜖, 1/𝜌, and 1/𝜏,2 and are proper, meaning that they output linear models. Moreover, our Algorithm 1 and Algorithm 2 are computationally efficient while Algorithm 3 forsakes computational efficiency to achieve further improvements in sample complexity.

We now state our first algorithmic result, the proof of which can be found in Section 3. Theorem 1.3 (Efficient Replicable Algorithm 1). Fix 𝜖, 𝜏, 𝜌, 𝛿 (0, 1). Let 𝒟be a distribution over R𝑑 { 1, 1} that has linear margin 𝜏as in Definition 1.2. There is an algorithm that is 𝜌-replicable and, given 𝑚= 𝑂(𝜖 1𝜏 7𝜌 2 log(1/𝛿)) i.i.d. samples (𝑥, 𝑦) 𝒟, computes in time poly(𝑑, 1/𝜖, 1/𝜏, 1/𝜌, log(1/𝛿)) a unit vector 𝑤 R𝑑such that Pr(𝑥,𝑦) 𝒟[sgn(𝑤 𝑥) = 𝑦] 𝜖with probability at least 1 𝛿.

Algorithm 1 improves on the sample complexity of the two algorithms appearing in Impagliazzo et al. (2022), runs in polynomial time, and is proper. Our techniques follow a different path from that of Impagliazzo et al. (2022). As we alluded to before, their approach is fairly complicated and is based on the design of a replicable weak halfspace learning algorithm and then a replicable boosting algorithm that combines multiple weak learners. Our approach is single-shot and significantly simpler: Consider 𝐵independent non-overlapping batches of training examples. From each batch 𝑖 [𝐵], we find a hyperplane with normal vector 𝑤𝑖 R𝑑that has Ω(𝜏) margin on the training data. This can be achieved by running the standard Support Vector Machine (SVM) algorithm (Cortes and Vapnik, 1995; Vapnik, 2006). We then aggregate our vectors to a single average normal vector 𝑧= (1/𝐵)

𝑖 [𝐵] 𝑤𝑖. Finally, we project the vector 𝑧onto a lower-dimensional space, whose dimension does not depend on 𝑑, and we replicably round 𝑧using a rounding scheme due to Alon and Klartag (2017) for which we perform a novel analysis in the shared randomness setting. We emphasize that our algorithm gives a halfspace with the desired accuracy guarantee without the need to use any boosting schemes.

At a technical level, we avoid the dependence on the dimension 𝑑thanks to data-oblivious dimensionality reduction techniques (cf. Appendix B.3), a standard tool in the literature of large-margin halfspaces. Instead of rounding our estimates in the 𝑑-dimensional space, we first project them to a lower-dimensional space, whose dimension does

2By iteratively halving a guess for 𝜏, we may assume without loss of generality that 𝜏is known.

not depend on 𝑑, and we perform the rounding in that space. The crucial idea is that one can use the data-obliviousness of Johnson-Lindenstrauss (JL) matrices so that the projection matrices in two distinct executions are the same since the internal randomness is shared. Another technical aspect of our algorithm that differentiates it from prior works on the design of replicable algorithms is the use of a different rounding scheme known as the Alon-Kartag rounding scheme (cf. Section 2). While this rounding scheme was known, to the best of our knowledge, we are the first to utilize and analyze the scheme in the context of replicability. Consider the simple case of 1-dimensional data. In the same spirit as in Impagliazzo et al. (2022), we consider a random grid. But rather than rounding the point to a fixed element of each cell of the grid (e.g., its center), we randomly round it to one of the two endpoints of the cell using shared internal randomness so that, in expectation, the rounded point is the same as the original one. This is helpful as it preserves inner products in expectation, and therefore gives better concentration properties across multiple roundings. We believe that this rounding scheme can find more applications in the replicable learning literature. The detailed proof of Theorem 1.3 can be found in Section 3.

Despite the simplicity of our algorithm, there are technical subtleties that complicate its analysis. For instance, the projection to the low-dimensional space introduces a subtle complication we need to handle. In particular, using ideas from Grønlund et al. (2020) we can show that the aggregated vector in the high-dimensional space has the desired generalization properties. However, when we project it to the low-dimensional space there are vectors that are now misclassified, due to the error introduced by the JL mapping. Using the guarantees of the JL projection, we show that, uniformly over the data-generating distributions, this happens for only a small fraction of the population.

Algorithm 2 follows a similar framework as Algorithm 1 by running a non-replicable algorithm on independent batches of data and then aggregating the outputs replicably, illustrating the flexibility of our approach. We now describe this result in more detail, whose proof can be found in Section 4.

Theorem 1.4 (Efficient Replicable Algorithm 2). Fix 𝜖, 𝜏, 𝜌, 𝛿 (0, 1). Let 𝒟be a distribution over R𝑑 { 1, 1} that has linear margin 𝜏as in Definition 1.2. There is an algorithm that is 𝜌-replicable and, given 𝑚= 𝑂(𝜖 2𝜏 6𝜌 2 log(1/𝛿)) i.i.d. samples (𝑥, 𝑦) 𝒟, computes in time poly(𝑑, 1/𝜖, 1/𝜏, 1/𝜌, log(1/𝛿)) a unit vector 𝑤 R𝑑such that Pr(𝑥,𝑦) 𝒟[sgn(𝑤 𝑥) = 𝑦] 𝜖with probability at least 1 𝛿.

Compared to Algorithm 1, our Algorithm 2 achieves better dependence on 𝜏by incurring an additional 1/𝜖fac-

Replicable Learning of Large-Margin Halfspaces

tor in the sample complexity. At a technical level, as in (Lê Nguyen et al., 2020), we provide a convex surrogate that upper bounds the loss 1{𝑦(𝑥 𝑤) 𝜏/2}. Running SGD on this convex surrogate provides a unit vector 𝑤that, in expectation over the data, achieves a margin of at least 𝜏/2 for an 𝑂(𝜖)-mass of the population. We then apply a standard boosting trick to turn this guarantee into a high probability bound. Next, we work as in Algorithm 1: we run the above procedure 𝐵times to get 𝑤1, ..., 𝑤𝐵and aggregate our vectors into a single vector 𝑧= (1/𝐵)

𝑖 [𝐵] 𝑤𝑖. Lastly, we perform a JL-projection on 𝑧and then round using the Alon-Klartag rounding scheme, as in Algorithm 1.

Computationally Inefficient Reductions from DP. It is a corollary of the works of Bun et al. (2023) and Kalavasis et al. (2023) that one can use existing differentially private (DP) algorithms in order to obtain replicable learners. In particular, following the reduction of Bun et al. (2023), one can obtain a replicable algorithm for large-margin halfspaces with better sample complexity in terms of 𝜏, but in a computationally inefficient way. The idea is to take an off-the-shelf DP algorithm (recall Definition F.1) for this problem (e.g. Lê Nguyen et al. (2020)) and transform it into a replicable one. We remark that this transformation holds when the algorithm outputs finitely many different solutions and its running time is exponential in the number of these solutions. Fortunately, the pure DP algorithm from Lê Nguyen et al. (2020) satisfies this finite co-domain property. The formal statement of the result we get by combining these two algorithms is presented below.

Proposition 1.5 (Inefficient Replicable Algorithm; follows from Lê Nguyen et al. (2020); Bun et al. (2023)). Fix 𝜖, 𝜏, 𝜌, 𝛿 (0, 1). Let 𝒟be a distribution over R𝑑 { 1, 1} that has linear margin 𝜏as in Definition 1.2. There is an algorithm that is 𝜌-replicable and, given 𝑚= 𝑂(𝜖 2𝜏 4𝜌 2 log(1/𝛿)) i.i.d. samples (𝑥, 𝑦) 𝒟, com-

putes, in time exp ( (1/𝜏)

log(1/(𝜖𝜌𝛿))

𝜏2 ) poly(𝑑), a unit vector

𝑤 R𝑑such that Pr(𝑥,𝑦) 𝒟[sgn(𝑤 𝑥) = 𝑦] 𝜖, with probability at least 1 𝛿.

As mentioned above, the proof of this result follows by combining the DP-to-Replicability transformation of Bun et al. (2023) (cf. Proposition F.3) with the pure DP algorithm for learning large-margin halfspaces due to Lê Nguyen et al. (2020) (cf. Proposition F.2). We note that since the algorithm of Lê Nguyen et al. (2020) is proper and the reduction of Bun et al. (2023) is based on sub-sampling, the output of Proposition 1.5 is also a linear classifier. The main issue with this approach is that, apart from not being a polynomial time algorithm, the reduction requires a quadratic blow-up in the sample complexity of the provided DP algorithm.

To be more specific, the DP algorithm of (Lê Nguyen et al., 2020) has sample complexity 𝑂(𝜖 1𝜏 2) for accuracy 𝜖 and margin 𝜏. This means that the replicable algorithm of Proposition 1.5 incurs a quadratic blow-up in the sample complexity on the parameters 𝜖, 𝜏. The cost of this transformation is tight under standard cryptographic hardness assumptions (Bun et al., 2023). Thus, it is unlikely that we can reduce the dependence on the error parameter 𝜖using such a generic transformation. We emphasize that our efficient replicable algorithm (cf. Algorithm 1) has linear sample complexity dependence on 1/𝜖. The blow-up on the running time of the algorithm is due to the use of correlated sampling in the transformation of Bun et al. (2023) which requires exponential running time in the size of the output space. We remark that in the case of the algorithm of Lê Nguyen et al. (2020), the size of the output space is already exponential in 1/𝜏2.

In our work, we also revisit this inefficient algorithm and improve on its sample complexity and runtime, as follows:

Theorem 1.6 (Improved Inefficient Replicable Algorithm 3). Fix 𝜖, 𝜏, 𝜌, 𝛿 (0, 1). Let 𝒟be a distribution over R𝑑 { 1, 1} that has linear margin 𝜏as in Definition 1.2. Then there is a 𝜌-replicable algorithm such that given 𝑚= 𝑂(𝜖 1𝜏 4𝜌 2 log(1/𝛿)) i.i.d. samples (𝑥, 𝑦) 𝒟, computes in time poly(𝑑) poly(1/𝜖, 1/𝜏, 1/𝜌, 1/𝛿)1/𝜏2, a unit vector 𝑤 R𝑑satisfying Pr(𝑥,𝑦) 𝒟[sgn(𝑤 𝑥) = 𝑦] 𝜖 with probability at least 1 𝛿.

Compared to the DP-to-Replicability reduction, Theorem 1.6 has better dependence on 1/𝜖and better running time. For the proof of this result, we refer to Section 5.

1.2. Related Work

In terms of technique, a related work is that of (Nissim et al., 2016) for differentially private clustering. The authors also use the JL projection followed by a discretization step as part of their algorithm. Similar to our techniques, this serves to avoid a factor in the discretization scheme that scales with the dimension. One important difference is that (Nissim et al., 2016) solves a task on the samples while we need to solve a task on a distribution using samples. Moreover, while (Nissim et al., 2016) also use a discretization step (not based on the Alon-Kartag scheme) to find a dense part of the space to use as the center of the cluster, the discretization step in our work serves a different purpose: to ensure that our estimates, across two i.i.d. executions, will be the same.

Replicability. Pioneered by Impagliazzo et al. (2022), there has been a growing interest from the learning theory community in studying replicability as an algorithmic property in various learning tasks. Among other things,

Replicable Learning of Large-Margin Halfspaces

their work showed that the fundamental class of statistical queries, which appears in various settings (see e.g., Blum et al. (2003); Gupta et al. (2011); Goel et al. (2020); Fotakis et al. (2021) and the references therein) can be made replicable. Subsequently, Esfandiari et al. (2023a;b) studied replicable algorithms in the context of multi-armed bandits and clustering. Later, Eaton et al. (2023); Karbasi et al. (2023) studied replicability in the context of Reinforcement Learning. Recently, Bun et al. (2023); Kalavasis et al. (2023; 2024) established equivalences between replicability and other notions of algorithmic stability such as differential privacy (DP), and Moran et al. (2023) derived more fine-grained characterizations of these equivalences. It is worth mentioning that Malliaris and Moran (2022) had already established equivalences between various notions of algorithmic stability and finiteness of the Littlestone dimension of the underlying concept class. Inspired by Impagliazzo et al. (2022), a related line of work (Chase et al., 2023b; Dixon et al., 2023; Chase et al., 2023a) proposed and studied alternative notions of replicability such as listreplicability, where the requirement is that when the algorithm is executed multiple times on i.i.d. datasets, then the number of different solutions it will output across these executions is small.

Large-Margin Halfspaces. The problem of learning large-margin halfspaces has been extensively studied and has inspired various fundamental algorithms (Rosenblatt, 1958; Vapnik, 1999; Freund and Schapire, 1997; 1998). In the DP setting, Blum et al. (2005) gave a dimensiondependent construction based on a private version of the perceptron algorithm. This was later improved by Lê Nguyen et al. (2020) who gave new DP algorithms for this task with dimension-independent guarantees based on the Johnson-Lindenstrauss transformation. Next, Bun et al. (2020) constructed noise-tolerant and private PAC learners for large-margin halfspaces whose sample complexity also does not depend on the dimension. Beimel et al. (2019) and Kaplan et al. (2020) designed private algorithms for learning halfspaces without margin guarantees when the domain is finite. Bassily et al. (2022b) stated an open problem of finding optimal DP algorithms for learning largemargin halfspaces both with respect to their running time and their sample complexity. Bassily et al. (2022a) studied DP algorithms for various learning tasks with margin, including halfspaces, kernels, and neural networks. In the area of robust statistics, Diakonikolas et al. (2023) showed a statistical-computational tradeoff in the problem of PAC learning large-margin halfspaces with random classification noise. For further results on robustly learning largemargin halfspaces, we refer to Diakonikolas et al. (2019) and the references therein.

2. The Main Tool: The Alon-Klartag Rounding Scheme

Inspired by Alon and Klartag (2017), we introduce and use the following rounding scheme AKround(𝑧, 𝛽) for a point 𝑧with parameter 𝛽: let 𝑜= (𝑜1, . . . , 𝑜𝑘) 𝑖.𝑖.𝑑. 𝑈[0, 𝛽] be uniformly random offsets and implicitly discretize R𝑘

using a grid of side length 𝛽centered at 𝑜. Let 𝑜(𝑧) R𝑘

denote the bottom-left corner of the cube in which 𝑧lies, i.e., the point obtained by rounding down all the coordinates of 𝑧. For a vector 𝑧, we let 𝑧[𝑖] be its 𝑖-th coordinate. Define 𝑝(𝑧)[𝑖] [0, 1] to be such that

𝑝(𝑧)[𝑖] 𝑜(𝑧)[𝑖] + (1 𝑝(𝑧)[𝑖]) (𝑜(𝑧)[𝑖] + 𝛽) = 𝑧[𝑖] .

Given offsets 𝑜and thresholds 𝑢= (𝑢1, . . . , 𝑢𝑘) with 𝑢𝑖 𝑈[0, 1], round a vector 𝑧to 𝑓𝑜,𝑢(𝑧) where the 𝑖-th coordinate is equal to 𝑜(𝑧)[𝑖] if 𝑢𝑖 𝑝(𝑧)[𝑖] and 𝑜(𝑧)[𝑖]+𝛽 otherwise. Crucially, in expectation, the rounded point 𝑓𝑜,𝑢(𝑧) coincides with 𝑧.

The next lemma, whose proof can be found in Appendix C, is useful in order to derive the replicability guarantees of our rounding scheme. Lemma 2.1 (Stability of Rounding). Let 𝑧, 𝑧 R𝑘. Then for independent uniform offsets 𝑜1, . . . , 𝑜𝑘 [0, 𝛽] and thresholds 𝑢1, . . . , 𝑢𝑘 [0, 1], we have Pr𝑜,𝑢[𝑓𝑜,𝑢(𝑧) = 𝑓𝑜,𝑢(𝑧 )] 2𝛽 1 𝑧 𝑧 1.

Our novel analysis of the stability of the rounding scheme under shared randomness (cf. Lemma 2.1) demonstrates its useful properties for designing replicable algorithms. We believe these properties may be of interest beyond the scope of this work and can find applications in designing replicable algorithms for different problems.

Next, we show that the Alon-Klartag rounding scheme additively preserves inner products with high probability. This is formalized below in a lemma whose proof can be found in Appendix C. Lemma 2.2 (Rounding preserves Inner Products). Let 𝑧, 𝑥 R𝑘be such that 𝑥 1. For uniform offsets 𝑜1, . . . , 𝑜𝑘 [0, 𝛽] and thresholds 𝑢1, . . . , 𝑢𝑘 [0, 1], we have Pr𝑜,𝑢 [ |𝑓𝑜,𝑢(𝑧) 𝑥 𝑧 𝑥| > 𝛼 ] 2 exp ( 2𝛼2𝛽 2) .

It is worth mentioning that the Alon-Klartag rounding scheme, along with dimensionality reduction techniques, was also used by Grønlund et al. (2020) in order to prove generalization bounds for SVMs.

3. Replicably Learning Large-Margin Halfspaces: Algorithm 1

In this section, we describe our first algorithm and prove its guarantees as stated in Theorem 1.3. Let 𝒟be a distribution

Replicable Learning of Large-Margin Halfspaces

over R𝑑 { 1, 1} with linear margin 𝜏 (0, 1) (cf. Definition 1.2). Thus, there is a unit vector 𝑤 R𝑑such that for all (𝑥, 𝑦) supp(𝐷), 𝑥 = 0, we have 𝑦(𝑥 𝑤 / 𝑥 ) 𝜏. Given 𝜖, 𝜌, 𝛿 (0, 1), our goal is to design a 𝜌-replicable learning algorithm that draws 𝑚= 𝑚(𝜖, 𝜏, 𝜌, 𝛿) i.i.d. samples from 𝒟and outputs ˆ𝑤 R𝑑such that, with probability at least 1 𝛿over the randomness of the samples and (potentially) the internal randomness of the algorithm, it holds that Pr(𝑥,𝑦) 𝒟[𝑦( ˆ𝑤 𝑥) 0] 𝜖.

Description of Algorithm 1. We consider 𝐵batches of 𝑛samples each. Hence, in total, we draw 𝑛𝐵i.i.d. samples from 𝒟. On each batch 𝑖 [𝐵], we run the standard SVM algorithm (cf. Lemma B.1) to find a hyperplane with normal vector 𝑤𝑖 R𝑑that has margin at least 𝜏/2 on all training data in the batch. We then compute the average normal vector 𝑧= (1/𝐵)

𝑖 [𝐵] 𝑤𝑖. Finally, we round 𝑧 as described in Section 2.

Algorithm 1 Replicable Large-Margin Halfspaces

1: 𝑘 𝐶1𝜏 2 log(1/𝜖𝜏𝜌𝛿) 2: 𝐵 𝐶2𝜏 4𝜌 2 log(1/𝜖𝜏𝜌𝛿) 3: 𝑛 𝐶3𝜖 1𝜏 3 log(1/𝜖𝜏𝜌𝛿) 4: 𝛽 𝐶4𝜏/ log(1/𝜖𝜏𝜌𝛿) 5: for 𝑖= 1, 2, ...𝐵do 6: 𝑆𝑖= batch of 𝑛i.i.d. samples from 𝒟 7: 𝑤𝑖 SVM(𝑆𝑖, 𝜏/2) 8: end for 9: 𝑧 (1/𝐵)

𝑖 [𝐵] 𝑤𝑖 10: Draw 𝐴 R𝑘 𝑑with 𝐴𝑖,𝑗 𝒩(0, 1/𝑘) (shared randomness) 11: 𝑏 AKround(𝐴𝑧, 𝛽) (shared randomness, cf. Section 2) 12: return ˆ𝑤= 𝐴 𝑏/ 𝐴 𝑏

Correctness of Algorithm 1. A straightforward adaptation of the results of Grønlund et al. (2020) (cf. Lemma B.1) shows that, with probability 1 𝛿/(10𝐵) over the samples, the classifier 𝑤𝑖has margin at least 𝜏/4 in a ( 1 𝑂 ( log 𝑛+log(𝐵/𝛿)

𝜏2𝑛 ) ) fraction of the population, i.e.,

Pr (𝑥,𝑦) 𝒟[𝑦(𝑤 𝑖𝑥)/ 𝑥 < 𝜏/4] 𝑂 ( log 𝑛+ log(𝐵/𝛿)

We denote the complement of this event as 𝐸𝑖and condition on not observing 𝑖 [𝐵]𝐸𝑖.

Now under this event, for all the vectors 𝑤𝑖, 𝑖 [𝐵] and all points (𝑥, 𝑦) R𝑑 { 1, 1}, it holds that 𝑦(𝑤 𝑖𝑥)/ 𝑥 1, since 𝑤𝑖is a unit vector. Furthermore, for a (1 𝑂(1/𝜏2𝑛))-fraction of the population, the margin is at least 𝜏/4, i.e., 𝑦(𝑤 𝑖𝑥)/ 𝑥 𝜏/4. Intuitively, this means that the vector 𝑧= 1 𝐵

𝑖 [𝐵] 𝑤𝑖should have margin at least

𝜏/8, except for an 𝑂(1/(𝜏3𝑛)) fraction of the population. Formally, for (𝑥, 𝑦) 𝒟, let 𝑍𝑖be the indicator variable such that 𝑦 (𝑤 𝑖𝑥)/ 𝑥 < 𝜏/4 and 𝑍:=

𝑖 [𝐵] 𝑍𝑖. Then

(𝑥/ 𝑥 ) = 1

𝑖 [𝐵] 𝑦 𝑤 𝑖(𝑥/ 𝑥 )

𝐵( 𝑍+ (𝐵 𝑍)𝜏/4) = 𝜏/4 𝑍

𝐵(1 + 𝜏/4) .

This means that if 𝑦(𝑧 𝑥)/ 𝑥 < 𝜏/8, then

𝐵(1 + 𝜏/4) < 𝜏/8 = 𝑍> 𝜏

It suffices to bound the probability of the event that 𝑍> Ω(𝜏𝐵) to bound the population error of 𝑧.

Notice that the summation of the fractions of the population where the 𝑤𝑖have margin less than 𝜏/4 is at most 𝑂(𝐵(log 𝑛+ log(𝐵/𝛿))/(𝜏2𝑛)). As noted above, at least Ω(𝜏𝐵) of the classifiers must simultaneously have margin less than 𝜏/4 for 𝑧to misclassify 𝑥. Thus the fraction of the population where 𝑧has margin smaller than 𝜏/8 is at most

𝑂 ( 𝐵(log 𝑛+ log(𝐵/𝛿))

) = 𝑂 ( log 𝑛+ log(𝐵/𝛿)

Thus, choosing 𝑛 Ω(log(𝐵/𝛿)/(𝜏3𝜖)) ensures that the intermediary normal vector 𝑧= (1/𝐵) 𝐵 𝑖=1 𝑤𝑖satisfies Pr(𝑥,𝑦) 𝒟[𝑦(𝑧 𝑥/ 𝑥 ) < 𝜏/8] 𝜖/10 with probability at least 1 𝛿/10.

The following lemma whose proof is deferred to Appendix D ensures that projecting and rounding in the lower dimension approximately preserves the performance of 𝑧 with respect to the 0-1 loss (as opposed to the 𝜏/8-loss).

Lemma 3.1. Fix 𝜖, 𝜏, 𝛿 (0, 1) and let 𝒟be a distribution over R𝑑 { 1} that admits a linear classifier with 𝜏-margin. Suppose 𝑧is a random unit vector satisfying Pr(𝑥,𝑦) 𝒟[𝑦(𝑧 𝑥/ 𝑥 ) < 𝜏/2] 𝜖with probability at least 1 𝛿over the draw of 𝑧. Define 𝑘 = Ω(𝜏 2 log(1/𝜖𝛿)) and 𝛽 = 𝑂(𝜏/ log(1/𝜖𝛿)). If 𝐴 R𝑘 𝑑is a JL-matrix (cf. Appendix B.3) and 𝑏= AKround(𝐴𝑧, 𝛽) (cf. Section 2), then ˆ𝑤= 𝐴 𝑏/ 𝐴 𝑏 satisfies Pr(𝑥,𝑦) 𝒟[𝑦( ˆ𝑤 𝑥/ 𝑥 ) 0] 2𝜖with probability at least 1 2𝛿.

We note that an application of it with 𝜏 = 𝜏/4, 𝜖 = 𝜖/10, and 𝛿 = 𝛿/10 yields that the final output ˆ𝑤of Algorithm 1 has 0-1 population error of at most 𝜖/5 with probability at least 1 𝛿/5 as desired.

Replicable Learning of Large-Margin Halfspaces

Replicability of Algorithm 1. We now state the lemma from which the replicability guarantees of our algorithm follow. Its proof is deferred to Appendix D.

Lemma 3.2. Fix 𝜖, 𝜏, 𝜌, 𝛿 (0, 1). Suppose 𝑤1, . . . , 𝑤𝐵 and 𝑤 1, . . . , 𝑤 𝐵are i.i.d. random unit vectors for 𝐵= Ω(𝜏 4𝜌 2 log(1/𝜖𝜏𝜌𝛿)) and 𝑧= (1/𝐵)

𝑖 [𝐵] 𝑤𝑖, 𝑧 = (1/𝐵) 𝑖 [𝐵] 𝑤 𝑖are their averages. Define 𝑘= Θ(𝜏 2 log(1/𝜖𝜏𝜌𝛿)) and 𝛽= Θ(𝜏/ log(1/𝜖𝜏𝜌𝛿). If 𝐴 R𝑘 𝑑is a JL-matrix (cf. Appendix B.3) and 𝑏 = AKround(𝐴𝑧, 𝛽), 𝑏 = AKround(𝐴𝑧 , 𝛽) (cf. Section 2), then 𝑏= 𝑏 with probability at least 1 𝜌over the draw of the 𝑤 𝑖, 𝑤𝑖 s, 𝐴, and AKround.

Note that the 𝑤𝑖 s are i.i.d. unit vectors across all batches and two independent executions since the samples in the 2𝐵batches in the two executions are drawn from the same distribution 𝒟and the output of the SVM algorithm depends only on its input sample. Hence an application of Lemma 3.2 ensures that Algorithm 1 is indeed 𝜌-replicable.

Sample Complexity & Running Time of Algorithm 1. The sample complexity is 𝑛𝐵= 𝑂(𝜖 1𝜏 7𝜌 2 log(1/𝛿)). By inspection, we see that the total running time of Algorithm 1 is poly(𝑑, 𝑛) = poly(𝑑, 1/𝜖, 1/𝜏, 1/𝜌, log(1/𝛿)).

4. Replicably Learning Large-Margin Halfspaces: Algorithm 2

Let 𝐵𝑑 1 denote the unit ℓ2-ball in 𝑑-dimensions. Our approach is inspired by the work of (Lê Nguyen et al., 2020) that designed a similar SGD approach for learning largemargin halfspaces under differential privacy constraints. Consider the following surrogate loss ℎ

𝐵𝑑 1 𝐵𝑑 1 { 1, +1} R 0

ℎ(𝑤; 𝑥, 𝑦) := max ( 0, 2 2

1{𝑦(𝑥 𝑤) < 𝜏/2}.

We remark that ℎ(𝑤; 𝑥, 𝑦) 1 when 𝑦(𝑥 𝑤) 𝜏/2 and ℎ(𝑤; 𝑥, 𝑦) = 0 when 𝑦(𝑥 𝑤) 𝜏. Also, since 𝑥, 𝑤 𝐵𝑑 1, an application of the Cauchy-Schwartz inequality reveals that ℎ(𝑤; 𝑥, 𝑦) [0, 2+2/𝜏]. Finally, ℎis piecewise linear with each piece being 2/𝜏-Lipschitz. Hence, ℎis 𝑂(1/𝜏)- Lipschitz.

Description of Algorithm 2. Fix 𝜖 (0, 1). We seek to minimize the following loss function over the ball 𝐵𝑑 1:

𝑓𝒟(𝑤) := E (𝑥,𝑦) 𝒟[ℎ(𝑤; 𝑥, 𝑦)] + 𝜖

By construction, the co-domain of 𝑓𝒟lies within [0, 2 + 2/𝜏+ 𝜖/10] [0, 𝑂(1/𝜏)]. Note also that 𝑓𝒟

is an upper bound on the 𝜏/2-population loss, i.e., Pr(𝑥,𝑦) 𝒟[𝑦(𝑥 𝑤) < 𝜏/2] 𝑓(𝑤) . First, regarding the minima of 𝑓𝒟, note that any vector 𝑤 𝐵𝑑 1 achieving a margin of 𝜏satisfies 𝑓𝒟(𝑤) 𝜖/10. This is because ℎ(𝑤; 𝑥, 𝑦) = 0 for all (𝑥, 𝑦) supp(𝒟). As a result, min𝑤 𝐵𝑑 1 𝑓𝒟(𝑤) 𝜖/10. Second, let us consider an 𝜖/10-optimal solution 𝑤 with respect to 𝑓𝒟, i.e., 𝑓𝒟(𝑤 ) min𝑤 𝐵𝑑 1 𝑓 𝜖/10 . The above discussion implies that 𝑓𝒟(𝑤 ) 𝜖/5 and is thus a 𝜏/2-margin classifier for an 𝜖/5-fraction of the population, i.e., 𝑤 satisfies Pr(𝑥,𝑦) 𝒟[𝑦(𝑥 𝑤 ) < 𝜏/2] 𝜖/5 .

Note that we may assume without loss of generality that the marginal of 𝒟over features is supported over 𝐵𝑑 1 since we normalize the input 𝑥 𝑥/ 𝑥 before applying a classifier 𝑤.

Since 𝑓is 𝑂(𝜖+ 1/𝜏) = 𝑂(1/𝜏)-Lipschitz and Ω(𝜖)- strongly convex, we can apply the following standard result:

Theorem 4.1 (Theorem 6.2 in Bubeck et al. (2015)). Let 𝑓be 𝜇-strongly convex with minimizer 𝑤* and assume that the (sub)gradient oracle 𝑔(𝑤) satisfies E[ 𝑔(𝑤) 2] 𝐺2. Then after 𝑇iterations, projected stochastic gradient descent SGD(𝒟, 𝑓, 𝑇) with step size 𝜂𝑡= 2/𝜇(𝑡+1) satisfies

2𝑡 𝑇(𝑇+ 1)𝑤𝑡

Since 𝐺2 = 𝑂(1/𝜏2), 𝜇= Ω(𝜖) and 𝑓(𝑤*) 𝜖/10, choosing 𝑇 Ω(𝜖 2𝜏 2) yields an 𝜖/10-optimal solution in expectation. Repeating this process independently for a small number of times and outputting the one with the lowest objective yields an 𝜖/5-optimal solution with high probability.

Lemma 4.2. Let 𝐵𝑑 1 be the unit ℓ2-ball in 𝑑dimensions. Fix 𝜖, 𝜏, 𝛿 (0, 1) and let 𝒟be a distribution over 𝐵𝑑 1 { 1} that admits a linear 𝜏-margin classifier. There is an algorithm boost SGD(𝒟, 𝜖, 𝜏, 𝛿) that outputs a unit vector 𝑤 R𝑑such that 𝑓𝒟( 𝑤) min𝑤 𝐵𝑑 1 𝑓𝒟(𝑤) + 𝜖 with probability at least 1 𝛿. Moreover, boost SGD has sample complexity 𝑂(𝜖 2𝜏 2 log(1/𝛿)) and running time poly(1/𝜖, 1/𝜏, log(1/𝛿), 𝑑).

The proof of Lemma 4.2 is deferred to Appendix E. Next, we repeat boost SGD and take an average to ensure concentration before proceeding as in Algorithm 1 with the random projection and rounding in the lower dimensional space. Compared to Algorithm 1, we obtain an improved sample dependence on 𝜏for Algorithm 2 since taking an average of 𝜖-optimal solutions to a convex objective function yields an 𝜖-optimal solution. However, we pay an extra factor of 𝜖in order to run SGD as a subroutine.

Replicable Learning of Large-Margin Halfspaces

Algorithm 2 Replicable Large-Margin Halfspaces

1: 𝑛= 𝐶1𝜖 2𝜏 2 log(1/𝜖𝜏𝜌𝛿) 2: 𝐵 𝐶2𝜏 4𝜌 2 log(1/𝜖𝜏𝜌𝛿) 3: 𝑘 𝐶3𝜏 2 log(1/𝜖𝜏𝜌𝛿) 4: 𝛽 𝐶4𝜏/ log(1/𝜖𝜏𝜌𝛿) 5: for 𝑖 1, . . . , 𝐵do 6: 𝑆𝑖 𝑛samples from 𝒟 7: 𝑆𝑖 {(𝑥/ 𝑥 , 𝑦) : (𝑥, 𝑦) 𝑆𝑖} 8: 𝑤𝑖 boost SGD(𝑆𝑖, 𝜖/10, 𝜏, 𝛿/𝐵) (cf. Lemma 4.2) 9: end for 10: 𝑧 (1/𝐵)

𝑖 [𝐵] 𝑤𝑖 11: Draw 𝐴 R𝑘 𝑑with 𝐴𝑖,𝑗 𝒩(0, 1/𝑘) (shared randomness) 12: 𝑏 AKround(𝐴𝑧, 𝛽) (shared randomness, cf. Section 2) 13: return ˆ𝑤= 𝐴 𝑏/ 𝐴 𝑏

The technical details of Algorithm 2 (the proof of Theorem 1.4) appear in Appendix E.1.

5. Replicably Learning Large-Margin Halfspaces: Algorithm 3

In this section, we provide an algorithm whose sample complexity scales as 𝑂(𝜖 1𝜏 4𝜌 2 log(1/𝛿)) and has running time poly(𝑑) (poly(1/𝜖, 1/𝜌, 1/𝜏, 1/𝛿))1/𝜏2. We remark that the sample complexity is better than the one obtained from the DP transformation in Proposition 1.5. Moreover, the running time is exponentially better than that obtained through the DP transformation.

Before this, we state a very useful result due to Bun et al. (2023) related to the sample complexity of replicably learning finite hypothesis classes, which is used in the proof of Theorem 1.6.

Proposition 5.1 (Theorem 5.13 in Bun et al. (2023)). Consider a finite concept class ℋ. There is a 𝜌replicable agnostic PAC learner r Learner Finite with accuracy 𝜖and confidence 𝛿for ℋwith sample complexity 𝑛= 𝑂 ( log2 |𝐻|+log(1/𝜌𝛿)

𝜖2𝜌2 log3(1/𝜌) ) . Moreover, if 𝒟is realizable then the sample complexity drops to 𝑂(𝜖 1𝜌 2 log2 |𝐻|). Finally, the algorithm terminates in time poly(|ℋ|, 𝑛).

Description of Algorithm 3 Let us first provide the algorithm s pseudo-code.

Algorithm 3 Replicable Large-Margin Halfspaces

1: 𝑘 𝐶𝜏 2 log(1/𝜖𝜏𝜌𝛿) 2: 𝑛 𝐶 𝜖 1𝜏 4𝜌 2 log(1/𝜖𝜌𝜏𝛿) 3: 𝑆 batch of 𝑛i.i.d. samples from 𝒟 4: Draw 𝐴 R𝑘 𝑑with 𝐴𝑖,𝑗 𝒩(0, 1/𝑘) (shared randomness) 5: 𝑆𝐴 (𝐴𝑥1, 𝑦1), . . . , (𝐴𝑥𝑛, 𝑦𝑛) 6: ℋ𝜏 a (𝜏/20)-net over vectors of length at most 1 in R𝑘

7: 𝑏 output of r Learner Finite from Proposition 5.1

with input 𝑆𝐴, ℋ𝜏 (shared randomness) 8: ˆ𝑤 𝐴 𝑏/ 𝐴 𝑏

Similar to the other algorithm, we first start by using a JL matrix to project the training set to a 𝑘-dimensional space, for 𝑘= Θ(𝜏 2). Then, we use a (𝜏/20)-net to cover all the unit vectors of this 𝑘-dimensional space, so the size of the

net is (𝐶 /𝜏) 𝑂(𝜏 2), for some absolute constant 𝐶 > 0. We think of these points of the net as our hypothesis class ℋ𝜏. By the properties of the JL transform, we can show that with high probability 1 𝛿/10, there exists a vector in this class that classifies the entire training set correctly. Moreover, we can show that this classifier has small generalization error. This is formalized in the following result, which is essentially a high-probability version of Lemma B.6 from Lê Nguyen et al. (2020).

Lemma 5.2. Fix 𝜖 , 𝛿𝐽𝐿 (0, 1). Let 𝒟satisfy Definition 1.2 with margin 𝜏and suppose 𝑤 satisfies 𝑦(𝑤 ) 𝑥 𝜏for every (𝑥, 𝑦) supp(𝒟). For a JLmatrix 𝐴 R𝑘 𝑑, as stated in Lemma B.3 with 𝑘= Ω(𝜏 2 log(1/𝛿𝐽𝐿)), let 𝐺𝐴 R𝑑 { 1, 1} be the set of points (𝑥, 𝑦) of supp(𝒟) that satisfy

𝐴𝑥 2 𝑥 2 𝜏 𝑥 2/100, and,

𝑦(𝐴𝑤 / 𝐴𝑤 ) (𝐴𝑥/ 𝐴𝑥 ) 96𝜏/100.

Let 𝐸1 be the event (over 𝐴) that Pr(𝑥,𝑦) 𝒟[(𝑥, 𝑦) 𝐺𝐴] 1 𝜖 and 𝐸2 be the event (over 𝐴) that 𝐴𝑤 2 𝑤 2 𝜏 𝑤 2/100. Then it holds that Pr𝐴[𝐸1 𝐸2] 1 𝛿𝐽𝐿/𝜖 .

The proof of Lemma 5.2 appears in Appendix B.3.1. One way to view Lemma 5.2 is that, with probability 1 𝛿𝐽𝐿/𝜖

over the random choice of 𝐴, the classifier 𝐴𝑤 / 𝐴𝑤 will have 96𝜏/100 margin on a 1 𝜖 fraction of 𝒟, where the choice of 𝜖 will be specified later according to Lemma 5.2. Let us condition on this event for the rest of the proof, which we call 𝐸𝑟. Let 𝑤 be the point of the net that 𝐴𝑤 / 𝐴𝑤 is rounded to. Recall that we round to the closest point on a 𝜏/20-net of the unit ball with respect to the ℓ2-norm, so that 𝑤 is 𝜏/20 close to the normalized

Replicable Learning of Large-Margin Halfspaces

version of 𝐴𝑤 . Notice that under the event 𝐸𝑟, for all points (𝑥, 𝑦) 𝐺𝐴we have that

𝐴𝑤 𝐴𝑥 𝐴𝑥 ( 𝐴𝑤

𝐴𝑤 𝑤 ) 𝐴𝑥 𝐴𝑥

𝐴𝑤 𝐴𝑥 𝐴𝑥 𝐴𝑤

96𝜏/100 𝜏/20 > 9/(10𝜏) ,

where the first inequality follows from Cauchy-Schwartz and the second inequality from the definition of the net. Since the hypothesis class ℋ𝜏has finite size, we can use Proposition 5.1 from Bun et al. (2023) which states that 𝑂(𝜖 1𝜌 2 log(1/𝛿) log2 |ℋ|) samples are sufficient to 𝜌-replicably learn a hypothesis class ℋin the realizable setting with error at most 𝜖. One technical complication we need to handle is that 𝒟is not necessarily realizable with respect to ℋ𝜏. Nevertheless, under 𝐸𝑟, we have shown that for 𝑤 ℋ𝜏it holds that Pr(𝑥,𝑦) 𝒟[𝑦(( 𝑤 ) 𝐴𝑥/ 𝐴𝑥 ) < 9𝜏/10] 𝜖 . Let us denote by 𝒟𝑟the distribution 𝒟conditioned on the event that 𝑦(( 𝑤 ) 𝐴𝑥/ 𝐴𝑥 ) 9𝜏/10 and 𝒟𝑏its complement, i.e., 𝒟conditioned on 𝑦(( 𝑤 ) 𝐴𝑥/ 𝐴𝑥 ) < 9𝜏/10. Then, we can express 𝒟= (1 𝑝𝑏) 𝒟𝑟+ 𝑝𝑏 𝒟𝑏, where 𝑝𝑏 𝜖 . Hence, in a sample of size 𝑛from 𝒟, with probability at least 1 𝑛 𝜖 we only see samples drawn i.i.d. from 𝒟𝑟. Let us call this event 𝐸 𝑟and condition on it. Let us choose 𝜖 = 𝛿/(10𝑛), 𝛿𝐽𝐿= 𝜖 2/10 = 𝛿2/1000𝑛so that Pr[𝐸𝑟 𝐸 𝑟] 1 𝛿.

Under the events 𝐸𝑟, 𝐸 𝑟, we can use the replicable learner from Bun et al. (2023) to learn the realizable distribution 𝒟𝑟. Run the algorithm from Proposition 5.1 with parameters 𝛿/10, 𝜖/10, 𝜌/10.3 Since |ℋ| = (𝐶/𝜏)𝜏 2 for some absolute constant 𝐶, we see that we need 𝑛= 𝑂 ( 𝜖 1𝜏 4𝜌 2 log(1/𝛿) ) samples.

Replicability of Algorithm 3. Let us condition on the events 𝐸𝑟, 𝐸 𝑟across two executions of the algorithm. Since, the matrix 𝐴is the same, due to shared randomness, under these events the samples that the learner of Bun et al. (2023) receives are i.i.d. from the same realizable distribution. Then, the replicability guarantee follows immediately from the guarantees of Proposition 5.1 and the fact that all the events we have conditioned on occur with probability at least 1 𝜌/2.

Correctness of Algorithm 3. Here we examine the population error of the output of Proposition 5.1. Note that this

3We assume that 𝛿< 𝜌, 𝜖, otherwise we normalize 𝛿to get the bound.

error is analyzed with respect to the original distribution 𝒟, which is not necessarily realizable, rather than 𝒟𝑟, which is the distribution on which we run the learning algorithm.

Let us again condition on the events 𝐸𝑟, 𝐸 𝑟, and the event that the output of Proposition 5.1 satisfies the generalization bound it states. We assume without loss of generality that 𝛿 𝜖/2, otherwise we set 𝛿= 𝜖/2 without affecting the overall sample complexity of our algorithm. Notice that these events occur with probability at least 1 𝛿. Let 𝑏be the output of the algorithm. Then we have

Pr (𝑥,𝑦) 𝒟[𝑦(𝑏 𝑥) < 0] = (1 𝑝𝑏) Pr (𝑥,𝑦) 𝒟𝑟[𝑦(𝑏 𝑥) < 0]

+ 𝑝𝑏 Pr (𝑥,𝑦) 𝒟𝑏[𝑦(𝑏 𝑥) < 0]

(1 𝑝𝑏)𝜖/10 + 𝑝𝑏< 𝜖.

This concludes the proof.

Sample Complexity & Running Time of Algorithm 3. As noted before, we need 𝑛= 𝑂 ( 𝜖 1𝜏 4𝜌 2 log(1/𝛿) )

samples. Since we apply Proposition 5.1, we incur a running time of poly(𝑑) poly(1/𝜖, 1/𝜌, 1/𝜏, 1/𝛿)1/𝜏2.

6. Conclusion

In this work, we have developed new algorithms for replicably learning large-margin halfspaces. Our results vastly improve upon prior work on this problem. We believe that many immediate questions for future research arise from our work. First, it is natural to ask whether there are efficient algorithms that can achieve the 𝑂(𝜖 1𝜏 4𝜌 2 log(1/𝛿)) sample complexity bound of Algorithm 3. Also, it would be interesting to see if there are any (not necessarily efficient) replicable algorithms whose sample complexity scales as 𝑂(𝜖 1𝜏 2𝜌 2) or if there is some inherent barrier to pushing the dependence on 𝜏below 𝜏 4. Finally, our analysis of Algorithm 1 is pessimistic in the sense that it uses a pigeonhole principle argument to establish that the fraction of the population where the aggregate vector does not have margin Ω(𝜏) is 𝑂(1/𝜏3𝑛). It would be interesting to see whether this bound can be improved to 𝑂(1/𝜏2𝑛) using a different analysis, which would reduce the overall dependence of the algorithm on 𝜏.

Acknowledgements

Kasper Green Larsen is supported by Independent Research Fund Denmark (DFF) Sapere Aude Research Leader grant No 9064-00068B. Amin Karbasi acknowledges funding in direct support of this work from NSF (IIS-1845032), ONR (N0001419-1-2406), and the AI Institute for Learning-Enabled Optimization at Scale (TILOS). Grigoris Velegkas is supported by TILOS, the Onas-

Replicable Learning of Large-Margin Halfspaces

sis Foundation, and the Bodossaki Foundation. Felix Zhou is supported by TILOS.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Dimitris Achlioptas. Database-friendly random projections. In Proceedings of the twentieth ACM SIGMODSIGACT-SIGART symposium on Principles of database systems, pages 274 281, 2001. 14

Noga Alon and Bo az Klartag. Optimal compression of approximate inner products and dimension reduction. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 639 650. IEEE, 2017. 3, 5

Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 2016. 1

Maria-Florina Balcan and Vitaly Feldman. Statistical active learning algorithms for noise tolerance and differential privacy. Algorithmica, 72(1):282 315, 2015. doi: 10.1007/S00453-014-9954-9. URL https://doi. org/10.1007/s00453-014-9954-9. 2

Philip Ball. Is ai leading to a reproducibility crisis in science? Nature, 624(7990):22 25, 2023. 1

Raef Bassily, Mehryar Mohri, and Ananda Theertha Suresh. Differentially private learning with margin guarantees. Advances in Neural Information Processing Systems, 35:32127 32141, 2022a. 5

Raef Bassily, Mehryar Mohri, and Ananda Theertha Suresh. Open problem: Better differentially private learning algorithms with margin guarantees. In Conference on Learning Theory, pages 5638 5643. PMLR, 2022b. 5

Amos Beimel, Shay Moran, Kobbi Nissim, and Uri Stemmer. Private center points and learning of halfspaces. In Conference on Learning Theory, pages 269 282. PMLR, 2019. 5

Avrim Blum, Adam Kalai, and Hal Wasserman. Noisetolerant learning, the parity problem, and the statistical query model. Journal of the ACM (JACM), 50(4):506 519, 2003. 5

Avrim Blum, Cynthia Dwork, Frank Mc Sherry, and Kobbi Nissim. Practical privacy: the sulq framework. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACTSIGART symposium on Principles of database systems, pages 128 138, 2005. 5

Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015. 7

Mark Bun, Marco Leandro Carmosino, and Jessica Sorrell. Efficient, noise-tolerant, and private learning via boosting. In Conference on Learning Theory, pages 1031 1077. PMLR, 2020. 5

Mark Bun, Marco Gaboardi, Max Hopkins, Russell Impagliazzo, Rex Lei, Toniann Pitassi, Satchit Sivakumar, and Jessica Sorrell. Stability is stable: Connections between replicability, privacy, and adaptive generalization. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 520 527, 2023. 1, 2, 4, 5, 8, 9, 18

Zachary Chase, Bogdan Chornomaz, Shay Moran, and Amir Yehudayoff. Local borsuk-ulam, stability, and replicability. ar Xiv preprint ar Xiv:2311.01599, 2023a. 1, 5

Zachary Chase, Shay Moran, and Amir Yehudayoff. Stability and replicability in learning. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 2430 2439. IEEE, 2023b. doi: 10.1109/FOCS57990. 2023.00148. URL https://doi.org/10.1109/ FOCS57990.2023.00148. 1, 5

Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20:273 297, 1995. 1, 3

Ilias Diakonikolas, Daniel Kane, and Pasin Manurangsi. Nearly tight bounds for robust proper learning of halfspaces with a margin. Advances in Neural Information Processing Systems, 32, 2019. 5

Ilias Diakonikolas, Jelena Diakonikolas, Daniel M Kane, Puqian Wang, and Nikos Zarifis. Informationcomputation tradeoffs for learning margin halfspaces with random classification noise. In The Thirty Sixth Annual Conference on Learning Theory, pages 2211 2239. PMLR, 2023. 5

Peter Dixon, Aduri Pavan, Jason Vander Woude, and N. V. Vinodchandran. List and certificate complexities in replicable learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey

Replicable Learning of Large-Margin Halfspaces

Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. 1, 5

John Dunagan and Santosh S. Vempala. A simple polynomial-time rescaling algorithm for solving linear programs. In László Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 315 320. ACM, 2004. doi: 10.1145/ 1007352.1007404. URL https://doi.org/10. 1145/1007352.1007404. 2

Eric Eaton, Marcel Hussing, Michael Kearns, and Jessica Sorrell. Replicable reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. 1, 5

Hossein Esfandiari, Alkis Kalavasis, Amin Karbasi, Andreas Krause, Vahab Mirrokni, and Grigoris Velegkas. Replicable bandits. In The Eleventh International Conference on Learning Representations, 2023a. 1, 5

Hossein Esfandiari, Amin Karbasi, Vahab Mirrokni, Grigoris Velegkas, and Felix Zhou. Replicable clustering. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b. 1, 5

Dimitris Fotakis, Alkis Kalavasis, Vasilis Kontonis, and Christos Tzamos. Efficient algorithms for learning from coarse labels. In Conference on Learning Theory, pages 2060 2079. PMLR, 2021. 5

Casper Benjamin Freksen. An introduction to johnson-lindenstrauss transforms. ar Xiv preprint ar Xiv:2103.00564, 2021. 14

Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55 (1):119 139, 1997. 1, 5

Yoav Freund and Robert E Schapire. Large margin classification using the perceptron algorithm. In Proceedings of the eleventh annual conference on Computational learning theory, pages 209 217, 1998. 5

Badih Ghazi, Ravi Kumar, and Pasin Manurangsi. Userlevel differentially private learning via correlated sampling. In Marc Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pages 20172 20184, 2021. 1

Surbhi Goel, Aravind Gollakota, and Adam Klivans. Statistical-query lower bounds via functional gradients. Advances in Neural Information Processing Systems, 33: 2147 2158, 2020. 5

Allan Grønlund, Lior Kamma, and Kasper Green Larsen. Near-tight margin-based generalization bounds for support vector machines. In International Conference on Machine Learning, pages 3779 3788. PMLR, 2020. 3, 5, 6, 13

Anupam Gupta, Moritz Hardt, Aaron Roth, and Jonathan Ullman. Privately releasing conjunctions and the statistical query barrier. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 803 812, 2011. 5

Russell Impagliazzo, Rex Lei, Toniann Pitassi, and Jessica Sorrell. Reproducibility in learning. In Stefano Leonardi and Anupam Gupta, editors, STOC 22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 818 831. ACM, 2022. doi: 10.1145/3519935.3519973. URL https: //doi.org/10.1145/3519935.3519973. 1, 2, 3, 4, 5

Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604 613, 1998. 14

William B Johnson. Extensions of lipshitz mapping into hilbert space. In Conference modern analysis and probability, 1984, pages 189 206, 1984. 14

Alkis Kalavasis, Amin Karbasi, Shay Moran, and Grigoris Velegkas. Statistical indistinguishability of learning algorithms. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 15586 15622. PMLR, 2023. URL https://proceedings.mlr. press/v202/kalavasis23a.html. 1, 2, 4, 5

Replicable Learning of Large-Margin Halfspaces

Alkis Kalavasis, Amin Karbasi, Grigoris Velegkas, and Felix Zhou. On the computational landscape of replicable learning. ar Xiv preprint ar Xiv:2405.15599, 2024. 5

Haim Kaplan, Yishay Mansour, Uri Stemmer, and Eliad Tsfadia. Private learning of halfspaces: Simplifying the construction and reducing the sample complexity. Advances in Neural Information Processing Systems, 33: 13976 13985, 2020. 5

Amin Karbasi, Grigoris Velegkas, Lin Yang, and Felix Zhou. Replicability in reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. 1, 5

Guy Kindler, Anup Rao, Ryan O Donnell, and Avi Wigderson. Spherical cubes: optimal foams from computational hardness amplification. Communications of the ACM, 55 (10):90 97, 2012. 2

Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1895 1904. PMLR, 06 11 Aug 2017. URL https://proceedings.mlr. press/v70/kohler17a.html. 13

Huy Lê Nguyen, Jonathan Ullman, and Lydia Zakynthinou. Efficient private algorithms for learning large-margin halfspaces. In Algorithmic Learning Theory, pages 704 724. PMLR, 2020. 4, 5, 7, 8, 14, 18

Maryanthe Malliaris and Shay Moran. The unstable formula theorem revisited. ar Xiv preprint ar Xiv:2212.05050, 2022. 5

Shay Moran, Hilla Schefler, and Jonathan Shafer. The bayesian stability zoo. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. 5

Kobbi Nissim, Uri Stemmer, and Salil P. Vadhan. Locating a small cluster privately. In Tova Milo and Wang-Chiew Tan, editors, Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 413 427. ACM, 2016. doi: 10.1145/2902251.2902296. URL https: //doi.org/10.1145/2902251.2902296. 4

Joelle Pineau, Koustuv Sinha, Genevieve Fried, Rosemary Nan Ke, and Hugo Larochelle. Iclr reproducibility challenge 2019. Re Science C, 5(2):5, 2019. 1

Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958. 1, 5

Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984. 2

Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 1999. 5

Vladimir Vapnik. Estimation of dependences based on empirical data. Springer Science & Business Media, 2006. 3

Replicable Learning of Large-Margin Halfspaces

A. Comparison of Results

Table A.1. A comparison of prior work and our work. We denote by 𝑑the dimension, 𝜖the accuracy, 𝜌the replicability, and 𝜏the margin of the halfspace. We omit the logarithmic factors on the sample complexity and the runtime.

Replicable Algorithms for Large-Margin Halfspaces

Algorithms Sample Complexity Running Time Proper

[ILPS22] with foams rounding

(𝑑𝜖 3𝜏 8𝜌 2)1.01 2𝑑 poly(1/𝜖, 1/𝜌, 1/𝜏) No

[ILPS22] with box rounding

(𝑑3𝜖 4𝜏 10𝜌 2)1.01 poly(𝑑, 1/𝜖, 1/𝜌, 1/𝜏) No

Algorithm 1 (Theorem 1.3) 𝜖 1𝜏 7𝜌 2 poly(𝑑, 1/𝜖, 1/𝜌, 1/𝜏) Yes

Algorithm 2 (Theorem 1.4) 𝜖 2𝜏 6𝜌 2 poly(𝑑, 1/𝜖, 1/𝜌, 1/𝜏) Yes

[LNUZ20] via DP-to-Replicability reduction [BGH+23] (Proposition 1.5)

𝜖 2𝜏 4𝜌 2 poly(𝑑) exp ( (1/𝜏)

log(1/(𝜖𝜌))

Algorithm 3 (Theorem 1.6) 𝜖 1𝜏 4𝜌 2 poly(𝑑) poly(1/𝜖, 1/𝜌, 1/𝜏)1/𝜏2 Yes

B. Deferred Tools

B.1. SVM guarantees

The following result is a restatement of Theorem 2 from Grønlund et al. (2020)

Lemma B.1 (SVM Generalization Guarantee (Grønlund et al., 2020)). Let 𝒟be a distribution over R𝑑 { 1, 1}. Let

𝑛 N, 𝑆= (𝑥1, 𝑦1), . . . , (𝑥𝑛, 𝑦𝑛) 𝒟𝑛, and 𝑤 R𝑑be a unit vector such that 𝑦𝑖 ( 𝑤 𝑥𝑖 𝑥𝑖 ) 𝜏, 𝑖 [𝑛] Then, for every 𝛿> 0 with probability at least 1 𝛿over the random draw of 𝑆, it holds that

Pr (𝑥,𝑦) 𝒟[𝑦 (𝑤 𝑥/ 𝑥 ) < 𝜏/2] 𝑂 ( log 𝑛+ log(1/𝛿)

We remark that even though the result of Grønlund et al. (2020) is stated for the generalization error with respect to the misclassification probability, i.e., Pr(𝑥,𝑦) 𝒟[𝑦 (𝑤 𝑥/ 𝑥 ) 0], their argument also applies to the 𝜏/2-margin loss, i.e., Pr(𝑥,𝑦) 𝒟[𝑦 (𝑤 𝑥/ 𝑥 ) < 𝜏/2], via a straightforward modification of some constants. In more detail, the only needed change is in the proof of part 1 of Claim 10 in Grønlund et al. (2020). Here the first condition says 𝑦 (𝑥 𝑤) 0 and could be made e.g. 𝑦 (𝑥 𝑤) 𝜏/4. Their result then holds with 𝜏/4 instead of 𝜏/2.

B.2. Vector-Valued Bernstein Concentration Inequality

We will use the following concentration inequality for the norm of random vectors.

Lemma B.2 (Vector Bernstein; (Kohler and Lucchi, 2017)). Let 𝑋1, . . . , 𝑋𝐵be independent random vectors with common dimension 𝑑satisfying the following for all 𝑖 [𝐵]:

(i) E[𝑋𝑖] = 0

(iii) E[ 𝑋𝑖 2] 𝜎2

Replicable Learning of Large-Margin Halfspaces

𝐵 𝐵 𝑖=1 𝑋𝑖. Then for any 𝑡 (0, 𝜎2/𝜇),

Pr[ 𝑍 𝑡] exp ( 𝑡2𝐵

B.3. Johnson-Lindenstrauss Lemma

The remarkable result of Johnson (1984) states that an appropriately scaled random orthogonal projection matrix preserves the norm of a unit vector with high probability. Indyk and Motwani (1998) showed that it suffices to independently sample each entry of the matrix from the standard normal distribution. Achlioptas (2001) further simplified the construction to independently sample each entry from the Rademacher distribution. See Freksen (2021) for a detailed survey of the development.

From hereonforth, we say that a random matrix 𝐴 R𝑘 𝑑is a JL-matrix if either 𝐴𝑖𝑗 𝑖.𝑖.𝑑. 𝒩(0, 1/𝑘) or 𝐴𝑖𝑗 𝑖.𝑖.𝑑. 𝑈{ 1/

We first state the standard distributional formulation of JL.

Lemma B.3 (Distributional JL; (Johnson, 1984; Indyk and Motwani, 1998; Achlioptas, 2001)). Fix 𝜖, 𝛿𝐽𝐿 (0, 1). Let 𝐴 R𝑘 𝑑be a JL-matrix for 𝑘= Ω(𝜖 2 log(1/𝛿𝐽𝐿)). Then for any 𝑥 R𝑑,

Pr 𝐴[ 𝐴𝑥 2 𝑥 2 > 𝜖 𝑥 2] 𝛿𝐽𝐿.

Let 𝑇be a set of vectors. By applying Lemma B.3 to the 𝑂(|𝑇|2) vectors 𝑢 𝑣for 𝑢, 𝑣 𝑇and taking a union bound, we immediately deduce the following result.

Lemma B.4 (JL Projection; (Johnson, 1984; Indyk and Motwani, 1998; Achlioptas, 2001)). Fix 𝜖, 𝛿𝐽𝐿 (0, 1). Consider a set 𝑇of 𝑑-dimensional vectors and a JL-matrix 𝐴 R𝑘 𝑑for 𝑘= Ω(𝜖 2 log(|𝑇|/𝛿𝐽𝐿)). Then,

Pr 𝐴 [ 𝑢, 𝑣 𝑇: 𝐴(𝑢 𝑣) 2 𝑢 𝑣 2 > 𝜖 𝑢 𝑣 2] 𝛿𝐽𝐿.

An application of Lemma B.4 towards the polarization identity for 𝑧, 𝑥 R𝑑

4𝑧 𝑥= 𝑧+ 𝑥 2 𝑧 𝑥 2

yields the following inner product preservation guarantee.

Corollary B.5 (JL Inner Product Preservation). Fix 𝜖, 𝛿𝐽𝐿 (0, 1). Let 𝐴 R𝑘 𝑑be a JL-matrix for 𝑘= Ω(𝜖 2 log(1/𝛿𝐽𝐿)). Then, for any 𝑥, 𝑧 R𝑑,

Pr 𝐴[|𝑧 𝑥 (𝐴𝑧) 𝐴𝑥| > 𝜖 𝑧 𝑥 ] 𝛿𝐽𝐿.

The next lemma is another simple implication of the distributional JL.

Lemma B.6 (Lemma 5 in Lê Nguyen et al. (2020)). Let 𝒟satisfy Definition 1.2 with margin 𝜏. For a JL-matrix 𝐴as stated in Lemma B.3, let 𝐺𝐴 R𝑑 { 1, 1} be the set of points (𝑥, 𝑦) of the population that satisfy

𝐴𝑥 2 𝑥 2 𝜏 𝑥 2/100 and

𝑦(𝐴𝑤 / 𝐴𝑤 ) (𝐴𝑥/ 𝐴𝑥 ) 96𝜏/100.

Then it holds that Pr𝐴,(𝑥,𝑦) 𝒟[(𝑥, 𝑦) 𝐺𝐴] 1 𝛿𝐽𝐿.

B.3.1. THE PROOF OF LEMMA 5.2

We finally prove Lemma 5.2, whose statement we repeat below for convenience.

Lemma 5.2. Fix 𝜖 , 𝛿𝐽𝐿 (0, 1). Let 𝒟satisfy Definition 1.2 with margin 𝜏and suppose 𝑤 satisfies 𝑦(𝑤 ) 𝑥 𝜏 for every (𝑥, 𝑦) supp(𝒟). For a JL-matrix 𝐴 R𝑘 𝑑, as stated in Lemma B.3 with 𝑘= Ω(𝜏 2 log(1/𝛿𝐽𝐿)), let 𝐺𝐴 R𝑑 { 1, 1} be the set of points (𝑥, 𝑦) of supp(𝒟) that satisfy

Replicable Learning of Large-Margin Halfspaces

𝐴𝑥 2 𝑥 2 𝜏 𝑥 2/100, and,

𝑦(𝐴𝑤 / 𝐴𝑤 ) (𝐴𝑥/ 𝐴𝑥 ) 96𝜏/100.

Let 𝐸1 be the event (over 𝐴) that Pr(𝑥,𝑦) 𝒟[(𝑥, 𝑦) 𝐺𝐴] 1 𝜖 and 𝐸2 be the event (over 𝐴) that 𝐴𝑤 2 𝑤 2 𝜏 𝑤 2/100. Then it holds that Pr𝐴[𝐸1 𝐸2] 1 𝛿𝐽𝐿/𝜖 .

Proof of Lemma 5.2. We know from Lemma B.6 that

[ Pr (𝑥,𝑦) 𝒟[(𝑥, 𝑦) / 𝐺𝐴] ] 𝛿𝐽𝐿,

so Markov s inequality gives that

[ Pr (𝑥,𝑦) 𝒟[(𝑥, 𝑦) / 𝐺𝐴] 𝜖 ] E𝐴 [ Pr(𝑥,𝑦) 𝒟[(𝑥, 𝑦) / 𝐺𝐴] ]

Similarly, the guarantees of the JL projection immediately yields

Pr 𝐴 [ 𝐴𝑤 2 𝑤 2 𝜏 𝑤 2/100 ] 𝛿𝐽𝐿.

C. Details for Section 2

Here we fill in the missing proofs from Section 2.

First, we restate and prove Lemma 2.1.

Lemma 2.1 (Stability of Rounding). Let 𝑧, 𝑧 R𝑘. Then for independent uniform offsets 𝑜1, . . . , 𝑜𝑘 [0, 𝛽] and thresholds 𝑢1, . . . , 𝑢𝑘 [0, 1], we have Pr𝑜,𝑢[𝑓𝑜,𝑢(𝑧) = 𝑓𝑜,𝑢(𝑧 )] 2𝛽 1 𝑧 𝑧 1.

Proof of Lemma 2.1. Fix a coordinate 𝑖 [𝑘]. The probability that 𝑜(𝑧)[𝑖] = 𝑜(𝑧 )[𝑖] is at most |𝑧[𝑖] 𝑧 [𝑖]|𝛽 1. Assume 𝑜(𝑧)[𝑖] = 𝑜(𝑧 )[𝑖]. Then, by the definition of 𝑝(𝑧), 𝑝(𝑧 ) we have that

|𝑧[𝑖] 𝑧 [𝑖]| = |(𝑝(𝑧)[𝑖] 𝑝(𝑧 )[𝑖])𝑜(𝑧)[𝑖]

+ (𝑝(𝑧 )[𝑖] 𝑝(𝑧)[𝑖])(𝑜(𝑧)[𝑖] + 𝛽)|

= |(𝑝(𝑧 )[𝑖] 𝑝(𝑧)[𝑖])𝛽|.

Note then the probability of 𝑓𝑜,𝑢(𝑧)[𝑖] = 𝑓𝑜,𝑢(𝑧 )[𝑖] is |𝑝(𝑧 )[𝑖] 𝑝(𝑧)[𝑖]| = |𝑧[𝑖] 𝑧 [𝑖]|𝛽 1.

By the uniform choice of 𝑢𝑖, we thus conclude that Pr𝑜,𝑢[𝑓𝑜,𝑢(𝑧)[𝑖] = 𝑓𝑜,𝑢(𝑧 )[𝑖]] 2𝛽 1|𝑧[𝑖] 𝑧 [𝑖]|. A union bound over all 𝑘coordinates implies Pr𝑜,𝑢[𝑓𝑜,𝑢(𝑧) = 𝑓𝑜,𝑢(𝑧 )] 2𝛽 1 𝑧 𝑧 1.

Next, we prove Lemma 2.2, whose statement is repeated.

Lemma 2.2 (Rounding preserves Inner Products). Let 𝑧, 𝑥 R𝑘be such that 𝑥 1. For uniform offsets 𝑜1, . . . , 𝑜𝑘 [0, 𝛽] and thresholds 𝑢1, . . . , 𝑢𝑘 [0, 1], we have Pr𝑜,𝑢 [ |𝑓𝑜,𝑢(𝑧) 𝑥 𝑧 𝑥| > 𝛼 ] 2 exp ( 2𝛼2𝛽 2) .

Proof of Lemma 2.2. Each of the random variables 𝑓𝑜,𝑢(𝑧)[𝑖] lies in an interval of length 𝛽, are independent, and have expectation 𝑧[𝑖]. By linearity, 𝑓𝑜,𝑢(𝑧) 𝑥 𝑧 𝑥= (𝑓𝑜,𝑢(𝑧) 𝑧) 𝑥. Let 𝑣= 𝑓𝑜,𝑢(𝑧) 𝑧. Then E[𝑥[𝑖] 𝑣[𝑖]] = 0 and 𝑥[𝑖] 𝑣[𝑖] lies in a range of length 𝛽𝑥[𝑖]. By Hoeffding s inequality, we have Pr[|𝑣 𝑥| > 𝛼] < 2 exp ( 2𝛼2/(

𝑖𝛽2𝑥[𝑖]2) ) 2 exp ( 2𝛼2𝛽 2) .

Replicable Learning of Large-Margin Halfspaces

D. Details for Algorithm 1

Here we fill in the missing proofs from Section 3.

We begin with Lemma 3.1, whose statement we repeat below for convenience.

Lemma 3.1. Fix 𝜖, 𝜏, 𝛿 (0, 1) and let 𝒟be a distribution over R𝑑 { 1} that admits a linear classifier with 𝜏-margin. Suppose 𝑧is a random unit vector satisfying Pr(𝑥,𝑦) 𝒟[𝑦(𝑧 𝑥/ 𝑥 ) < 𝜏/2] 𝜖with probability at least 1 𝛿over the draw of 𝑧. Define 𝑘= Ω(𝜏 2 log(1/𝜖𝛿)) and 𝛽= 𝑂(𝜏/ log(1/𝜖𝛿)). If 𝐴 R𝑘 𝑑is a JL-matrix (cf. Appendix B.3) and 𝑏= AKround(𝐴𝑧, 𝛽) (cf. Section 2), then ˆ𝑤= 𝐴 𝑏/ 𝐴 𝑏 satisfies Pr(𝑥,𝑦) 𝒟[𝑦( ˆ𝑤 𝑥/ 𝑥 ) 0] 2𝜖with probability at least 1 2𝛿.

Proof of Lemma 3.1. Let 𝒟𝑧:= 𝒟| {𝑦(𝑧 𝑥/ 𝑥 ) 𝜏/2} be the distribution obtained from 𝒟by conditioning on 𝑧having large margin and let 𝒟𝑏denote the distribution 𝒟conditioned on the complement. We can decompose 𝒟= (1 𝑝𝑏) 𝒟𝑧+ 𝑝𝑏 𝒟𝑏for 𝑝𝑏 𝜖. To complete the correctness argument, we need to show that with high probability, the projection step of 𝑧, 𝑥onto the low-dimensional space 𝐴𝑧, 𝐴𝑥and the rounding 𝑏= AKround(𝐴𝑧, 𝛽) approximately preserves the inner product 𝑦(𝑏 𝐴𝑥/ 𝑥 ) for a 1 𝜖fraction of 𝒟𝑧. Then the final classifier 𝐴 𝑏still has low population error. In particular, it suffices to show that Pr(𝑥,𝑦) 𝒟𝑧[𝑦(𝑏 𝐴𝑥/ 𝑥 ) < 𝜏/4] 𝜖with probability at least 1 𝛿over the random choice of 𝐴, 𝑏. Then, the total population error is at most 𝜖+ 𝑝𝑏 2𝜖with probability at least 1 2𝛿.

Fix (𝑥, 𝑦) supp(𝒟𝑧) and remark that

= 𝑦 (𝐴𝑧) 𝐴 ( 𝑥

) 𝑦 (𝐴𝑥 𝑏) 𝐴𝑥

By the choices of

𝑘 Ω(𝜏 2 log(1/𝜖𝛿)), 𝛽 𝑂(𝜏/ log(1/𝜖𝛿)),

the JL lemma (cf. Corollary B.5) and Alon-Kartag rounding scheme (cf. Lemma 2.2) ensures that

[ Pr (𝑥,𝑦) 𝒟𝑧[𝑦(𝑏 𝐴𝑥/ 𝑥 ) < 𝜏/4] ] 𝜖𝛿.

An application of Markov s inequality yields

[ Pr (𝑥,𝑦) 𝒟𝑧[𝑦(𝑏 𝐴𝑥/ 𝑥 ) < 𝜏/4] > 𝜖 ] 𝛿.

All in all, with probability at least 1 𝛿over 𝐴, 𝑏, the population error is at most

Pr (𝑥,𝑦) 𝒟[𝑦(𝐴 𝑏) 𝑥 0]

Pr (𝑥,𝑦) 𝒟[𝑦(𝑏 𝐴𝑥/ 𝑥 ) < Θ(𝜏)] < 2𝜖,

concluding the correctness argument of our algorithm.

Next, we prove Lemma 3.2, whose statement we again repeat.

Lemma 3.2. Fix 𝜖, 𝜏, 𝜌, 𝛿 (0, 1). Suppose 𝑤1, . . . , 𝑤𝐵and 𝑤 1, . . . , 𝑤 𝐵are i.i.d. random unit vectors for 𝐵= Ω(𝜏 4𝜌 2 log(1/𝜖𝜏𝜌𝛿)) and 𝑧= (1/𝐵)

𝑖 [𝐵] 𝑤𝑖, 𝑧 = (1/𝐵)

𝑖 [𝐵] 𝑤 𝑖are their averages. Define 𝑘= Θ(𝜏 2 log(1/𝜖𝜏𝜌𝛿)) and 𝛽= Θ(𝜏/ log(1/𝜖𝜏𝜌𝛿). If 𝐴 R𝑘 𝑑is a JL-matrix (cf. Appendix B.3) and 𝑏= AKround(𝐴𝑧, 𝛽), 𝑏 = AKround(𝐴𝑧 , 𝛽) (cf. Section 2), then 𝑏= 𝑏 with probability at least 1 𝜌over the draw of the 𝑤 𝑖, 𝑤𝑖 s, 𝐴, and AKround.

Replicable Learning of Large-Margin Halfspaces

Proof of Lemma 3.2. An application of the vector Bernstein concentration inequality (cf. Lemma B.2) yields the following: let 𝑆1 = { 𝑤(1) 𝑖 }

𝑖 [𝐵] and 𝑆2 = { 𝑤(2) 𝑖 }

𝑖 [𝐵] be two sets of independent and identically distributed random vectors in R𝑑

with 𝑤(𝑗) 𝑖 1 for all 𝑖 [𝐵], 𝑗 {1, 2}. Then we can set 𝑋𝑖= 𝑤(1) 𝑖 E 𝑤. Notice that 𝑋𝑖 2 and 𝑋𝑖 2 4.

Hence, an application of Lemma B.2 yields for any 𝑡 (0, 2)

𝑤(1) 𝑖 E[𝑤] 𝑡 ]

) = exp ( Ω( 𝑡2𝐵) )

𝑖 [𝐵] 𝑤(1) 𝑖 , 𝑧 = 1 𝐵

𝑖 [𝐵] 𝑤(2) 𝑖 . Choosing 𝐵 Ω(𝑡 2 log(1/𝜌)) ensures that Pr𝑆1,𝑆2 [ 𝑧 𝑧 𝑡] 𝜌/10.

We condition on the event that 𝑧 𝑧 𝑡. Since we are sharing the randomness across the two executions, we use the same JL projection matrix 𝐴. Thus our choice of 𝑘 Ω(𝜏 2 log(1/𝜌)) gives that 𝐴𝑧 𝐴𝑧 (1 + 𝜏)𝑡 2𝑡, with probability at least 1 𝜌/10 (cf. Lemma B.4). It remains to show that after the rounding step, with probability at least 1 𝜌/10, the two rounded vectors will be the same. The size of our rounding grid is 𝛽= Θ(𝜏) and the target dimension of our JL-matrix is 𝑘= Θ(𝜏 2). Thus by Lemma 2.1, the probability that the two points round to different vectors is Θ(𝛽 1) 𝐴𝑧 𝐴𝑧 1 Θ(𝛽 1

𝑘) 𝐴𝑧 𝐴𝑧 2 = Θ(𝑡

𝑘/𝛽) (cf. Lemma 2.1). Thus, if we pick

𝐵 Ω(𝑘/(𝛽2𝜌2)) = Ω(log(1/𝜖𝜏𝜌𝛿)/(𝜏4𝜌2)), (D.1)

we complete the argument.

E. Details for Algorithm 2

Here we fill in the missing proofs from Section 4.

We now prove Lemma 4.2, whose statement we repeat below for convenience.

Lemma 4.2. Let 𝐵𝑑 1 be the unit ℓ2-ball in 𝑑dimensions. Fix 𝜖, 𝜏, 𝛿 (0, 1) and let 𝒟be a distribution over 𝐵𝑑 1 { 1} that admits a linear 𝜏-margin classifier. There is an algorithm boost SGD(𝒟, 𝜖, 𝜏, 𝛿) that outputs a unit vector 𝑤 R𝑑

such that 𝑓𝒟( 𝑤) min𝑤 𝐵𝑑 1 𝑓𝒟(𝑤) + 𝜖with probability at least 1 𝛿. Moreover, boost SGD has sample complexity 𝑂(𝜖 2𝜏 2 log(1/𝛿)) and running time poly(1/𝜖, 1/𝜏, log(1/𝛿), 𝑑).

Proof of Lemma 4.2. Let 𝑇= Ω(𝜏 2𝜖 2). Run SGD(𝒟, 𝑇) for 𝑛= 𝑂(log(1/𝛿)) times to obtain solutions 𝑤1, . . . , 𝑤𝑛. Theorem 4.1 ensures that we attain an 𝜖/10-optimal solution in expectation for each 𝑤𝑖, 𝑖 [𝑛]. Markov s inequality then guarantees that we attain an 𝜖/5-optimal solution with probability at least 1/2 in each repetition. But then with probability at least 1 𝛿/10, at least one of the 𝑛solutions is 𝜖/5-optimal.

By a Hoeffding bound, we can estimate each 𝑓𝒟(𝑤𝑖) [0, 𝑂(1/𝜏)] up to an additive 𝜖/10 error with probability at least 1 𝛿/10 using 𝑂(𝑛𝜖 2𝜏 2 log(𝑛/𝛿)) samples. Outputting the classifier with the lowest estimated objective yields a 3𝜖/10-optimal solution with probability at least 1 𝛿/5.

Normalizing the selected classifier to a unit vector can only decrease ℎsince the feasible region is 𝐵𝑑 1 and incurs an additional loss of 𝜖/10 due to the regularizer. Thus we have a 4𝜖/10-optimal solution with probability at least 1 𝛿/5.

The total sample complexity is 𝑂(𝑛𝑇) + 𝑂(𝑛𝜖 2𝜏 2 log(𝑛/𝛿)) = 𝑂(𝜖 2𝜏 2 log(1/𝛿)).

E.1. Proof of Theorem 1.4

Correctness of Algorithm 2. From the choice of 𝑛 Ω(𝜖 2𝜏 2 log(𝐵/𝛿)), Lemma 4.2 ensures that each unit vector 𝑤𝑖produced in Algorithm 2 is 𝜖/10-optimal with probability at least 1 𝛿/(10𝐵). Hence with probability at least 1 𝛿/10, every 𝑤𝑖is 𝜖/10-optimal with respect to 𝑓𝒟. By Jensen s inequality, 𝑧= (1/𝐵) 𝐵 𝑖=1 𝑤𝑖is 𝜖/10-optimal with probability at least 1 𝛿/10. But then by the choice of 𝑓𝒟as a convex surrogate loss for 1{𝑦(𝑥 𝑤) < 𝜏/2}, 𝑧satisfies

Replicable Learning of Large-Margin Halfspaces

Pr(𝑥,𝑦) 𝒟[𝑦(𝑧 𝑥/ 𝑥 ) < 𝜏/2] 2𝜖/10 with probability at least 1 𝛿/10. In other words, the unit vector 𝑧has a population 𝜏/2-loss of at most 2𝜖/10 with probability at least 1 𝛿/10. But then similar to the correctness of Algorithm 1, an application of Lemma 3.1 concludes the proof of correctness.

Replicability of Algorithm 2. Similar to the correctness of Algorithm 1, we note that the output 𝑤𝑖of each execution of boost SGD is an i.i.d. unit vector. Thus an application of Lemma 3.2 yields the replicability guarantees of Algorithm 2.

Sample Complexity & Running Time of Algorithm 2. The sample complexity is 𝑛𝐵= 𝑂(𝜖 2𝜏 6𝜌 2 log(1/𝛿)) as required. Once again, we see that Algorithm 2 terminates in poly(𝑑, 1/𝜖, 1/𝜏, 1/𝜌, log(1/𝛿)) by inspection.

F. Details for Proposition 1.5

F.1. Differential Privacy

For 𝑎, 𝑏, 𝛼, 𝛽 [0, 1], let 𝑎 𝛼,𝛽𝑏denote the statement 𝑎 𝑒𝛼𝑏+ 𝛽and 𝑏 𝑒𝛼𝑎+ 𝛽. We say that two probability distributions 𝑃, 𝑄are (𝛼, 𝛽)-indistinguishable if 𝑃(𝐸) 𝛼,𝛽𝑄(𝐸) for any measurable event 𝐸.4

Definition F.1 (Approximate Differential Privacy). A learning rule 𝐴is an 𝑛-sample (𝛼, 𝛽)-differentially private if for any pair of datasets 𝑆, 𝑆 (𝒳 {0, 1})𝑛that differ on a single example, the induced posterior distributions 𝐴(𝑆) and 𝐴(𝑆 ) are (𝛼, 𝛽)-indistinguishable.

F.2. The Results of Lê Nguyen et al. (2020) and Bun et al. (2023)

Proposition 1.5 is based on combining the following two results. The first is a differentially private learner for large-margin halfspaces from Lê Nguyen et al. (2020).

Proposition F.2 (Theorem 6 in Lê Nguyen et al. (2020)). Let 𝛼, 𝜏, 𝜖, 𝛿> 0. Let 𝒟be a distribution over R𝑑 { 1, 1} that has linear margin 𝜏as in Definition 1.2. There is an algorithm that is (𝛼, 0)-differentially private and, given 𝑚= 𝑂(𝛼 1𝜖 1𝜏 2) i.i.d. samples (𝑥, 𝑦) 𝒟, computes in time exp ( 1/𝜏2) poly(𝑑, 1/𝛼, 1/𝜖, log(1/𝛿)) a normal vector 𝑤 R𝑑

such that Pr(𝑥,𝑦) 𝒟[sgn(𝑤 𝑥) = 𝑦] 𝜖, with probability at least 1 𝛿.

We also use the DP-to-Replicability reduction appearing in Bun et al. (2023).

Proposition F.3 (Corollary 3.18 in Bun et al. (2023)). Fix 𝑛 N, sufficiently small 𝜌 (0, 1), 𝜖, 𝛿 (0, 1) and 𝛼, 𝛽> 0. Let 𝒜: 𝒳𝑛 𝒴be an 𝑛-sample (𝛼, 𝛽)-differentially private algorithm with finite output space solving a statistical task with accuracy 𝜖and failure probability 𝛿. Then there is an algorithm 𝒜 : 𝒳𝑚 𝒴that is 𝜌-replicable and solves the same statistical task with 𝑚= 𝑂(𝜌 2𝑛2) samples with accuracy 𝜖and failure probability 𝛿.

4We use the notation (𝛼, 𝛽) DP instead of the more common (𝜖, 𝛿) DP to be consistent with the notation of the rest of the paper regarding the accuracy and probability of failure of the learning algorithms.