# replicable_learning_of_largemargin_halfspaces__95fda5ed.pdf Replicable Learning of Large-Margin Halfspaces Alkis Kalavasis * 1 Amin Karbasi * 1 2 Kasper Green Larsen * 3 Grigoris Velegkas * 1 Felix Zhou * 1 We provide efficient replicable algorithms for the problem of learning large-margin halfspaces. Our results improve upon the algorithms provided by Impagliazzo, Lei, Pitassi, and Sorrell (STOC, 2022). We design the first dimensionindependent replicable algorithms for this task which runs in polynomial time, is proper, and has strictly improved sample complexity compared to the one achieved by Impagliazzo et al. (2022) with respect to all the relevant parameters. Moreover, our first algorithm has sample complexity that is optimal with respect to the accuracy parameter πœ–. We also design an SGDbased replicable algorithm that, in some parameters regimes, achieves better sample and time complexity than our first algorithm. Departing from the requirement of polynomial time algorithms, using the DP-to-Replicability reduction of Bun, Gaboardi, Hopkins, Impagliazzo, Lei, Pitassi, Sivakumar, and Sorrell (STOC, 2023), we show how to obtain a replicable algorithm for large-margin halfspaces with improved sample complexity with respect to the margin parameter 𝜏, but running time doubly exponential in 1/𝜏2 and worse sample complexity dependence on πœ– than one of our previous algorithms. We then design an improved algorithm with better sample complexity than all three of our previous algorithms and running time exponential in 1/𝜏2. 1. Introduction The replicability crisis is omnipresent in many scientific disciplines including biology, medicine, chemistry, and, importantly, AI (Baker, 2016; Pineau et al., 2019). A recent article that appeared in Nature (Ball, 2023) explains *Equal contribution 1Yale University 2Google Research 3Aarhus University. Correspondence to: Alkis Kalavasis , Grigoris Velegkas , Felix Zhou . Proceedings of the 41 𝑠𝑑International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). how the reproducibility crisis in AI has a cascading effect across many other scientific areas due to its widespread applications in other fields such as medicine. Thus, an urgent goal is to design a formal framework through which we can argue about the replicability of experiments in ML. Such a theoretical framework was proposed in a recent work by Impagliazzo et al. (2022) and has been studied extensively in several learning settings (Esfandiari et al., 2023b;a; Bun et al., 2023; Kalavasis et al., 2023; Chase et al., 2023b; Dixon et al., 2023; Chase et al., 2023a; Eaton et al., 2023; Karbasi et al., 2023). Definition 1.1 (Replicability (Impagliazzo et al., 2022)). Let β„›be a distribution over random binary strings. A learning algorithm π’œis 𝑛-sample 𝜌-replicable if for any distribution π’Ÿover inputs and two independent datasets 𝑆, 𝑆 π’Ÿπ‘›, it holds that Pr𝑆,𝑆 π’Ÿπ‘›,π‘Ÿ β„›[π’œ(𝑆, π‘Ÿ) = π’œ(𝑆 , π‘Ÿ)] 𝜌. In words, this definition requires that when an algorithm π’œ is executed twice on different i.i.d. datasets 𝑆, 𝑆 but using shared internal randomness, then the output of the algorithm is exactly the same, with high probability. We note that sharing the randomness across the executions is crucial in achieving this replicability guarantee. Importantly, Dixon et al. (2023) showed that without sharing randomness, it is impossible to achieve such a strong notion of replicability even for simple tasks such as mean estimation. In practice, this can be achieved by simply publishing the random seed that the ML algorithms are executed with. As we extensively discuss in Section 1.2, Definition 1.1 turns out to be connected with other notions of stability such as differential privacy and perfect generalization (Ghazi et al., 2021; Bun et al., 2023). In this work, we study the fundamental problem of learning large-margin halfspaces, which means that no example is allowed to lie too close to the separating hyperplane. This task is related to foundational ML techniques such as the Perceptron algorithm (Rosenblatt, 1958), SVMs (Cortes and Vapnik, 1995), and Ada Boost (Freund and Schapire, 1997). Let us recall the concept class of interest. Definition 1.2 (Large-Margin Halfspaces). Let π’Ÿbe a distribution over R𝑑 { 1, 1} whose support does not contain π‘₯= 0. We say that π’Ÿhas linear margin 𝜏if there exists a unit vector 𝑀 R𝑑such that for any (π‘₯, 𝑦) supp(π’Ÿ) it Replicable Learning of Large-Margin Halfspaces Table 1.1. Replicable Algorithms for Large-Margin Halfspaces. We denote by 𝑑the dimension, πœ–the accuracy, 𝜌the replicability, and 𝜏the margin of the halfspace. We omit the logarithmic factors on the sample complexity. See Appendix A for a more detailed comparison. ALGORITHM SAMPLES PROPER? [ILPS22] (INEFFICIENT) (π‘‘πœ– 3𝜏 8𝜌 2)1.01 NO [ILPS22] (𝑑3πœ– 4𝜏 10𝜌 2)1.01 NO ALGORITHM 1 THEOREM 1.3 πœ– 1𝜏 7𝜌 2 YES ALGORITHM 2 THEOREM 1.4 πœ– 2𝜏 6𝜌 2 YES DP REDUCTION [LNUZ20; BGH+23] PROPOSITION 1.5 (INEFFICIENT) πœ– 2𝜏 4𝜌 2 YES ALGORITHM 3 THEOREM 1.6 (INEFFICIENT) πœ– 1𝜏 4𝜌 2 YES holds that 𝑦(𝑀 π‘₯/ π‘₯ ) 𝜏.1 Following the PAC learning definition of Valiant (1984), we say that an algorithm learns with accuracy πœ–and confidence 𝛿the class of 𝜏-margin halfspaces in 𝑑dimensions using 𝑛= 𝑛(πœ–, 𝛿, 𝑑, 𝜏) samples and runtime 𝑇= 𝑇(πœ–, 𝛿, 𝑑, 𝜏) if, given 𝑛i.i.d. samples from any distribution π’Ÿsatisfying Definition 1.2, the algorithm outputs, in time 𝑇, a classifier β„Ž: R𝑑 { 1, 1} such that Pr(π‘₯,𝑦) π’Ÿ[β„Ž(π‘₯) = 𝑦] πœ–, with probability at least 1 𝛿. We are interested in replicably learning large-margin halfspaces, i.e., designing algorithms for large-margin halfspaces that further satisfy Definition 1.1. We remark that when the feature domain is infinite, there is no replicable learning algorithm for learning halfspaces in general. Thus making some assumptions like the large-margin condition is necessary. In particular, Bun et al. (2023); Kalavasis et al. (2023) show that finiteness of the Littlestone dimension is a necessary condition for learnability by replicable algorithms, and it is known that even one-dimensional halfspaces over [0, 1] have infinite Littlestone dimension. See Table 1.1 for a comparison of prior work and our contributions. The work of Impagliazzo et al. (2022) provided the first replicable algorithms for 𝜏-margin halfspaces over R𝑑. The 1When we do not specify the norm, we assume the β„“2-norm. first algorithm of Impagliazzo et al. (2022), which uses the foams discretization scheme (Kindler et al., 2012), is 𝜌-replicable and returns a hypothesis β„Žthat, with probability at least 1 𝜌, satisfies Pr(π‘₯,𝑦) π’Ÿ[β„Ž(π‘₯) = 𝑦] πœ–. The sample complexity of this algorithm is roughly 𝑂((π‘‘πœ– 3𝜏 8𝜌 2)1.01) and the runtime is exponential in 𝑑and polynomial in 1/πœ–, 1/𝜌and 1/𝜏. The second algorithm of Impagliazzo et al. (2022), which uses the box discretization scheme, is 𝜌-replicable and returns a hypothesis β„Žthat, with probability at least 1 𝜌, satisfies Pr(π‘₯,𝑦) π’Ÿ[β„Ž(π‘₯) = 𝑦] πœ–with sample complexity 𝑂((𝑑3πœ– 4𝜏 10𝜌 2)1.01) and runtime which is polynomial in 𝑑, 1/πœ–, 1/𝜌and 1/𝜏. These two algorithms appear in the first two rows of Table 1.1. Some remarks are in order. In our setting, the sample complexity of learning large-margin halfspaces in the absence of the replicability requirement is 𝑂(1/πœ–πœ2). Notice that the sample complexity of both algorithms of Impagliazzo et al. (2022) depends on the dimension 𝑑of the problem. This is unexpected since the sample complexity of non-replicable algorithms for this task is dimension-independent. In the case of the replicable algorithms of Impagliazzo et al. (2022), the dependence on the dimension appears due to a rounding/discretization step, which is crucial in establishing the replicability guarantees. Second, both algorithms of Impagliazzo et al. (2022) are improper in the sense that the hypothesis β„Žthey output does not correspond to a halfspace. This is due to the use of a replicable boosting routine that outputs the majority vote over multiple halfspaces. As a general note, both of these algorithms are fairly complicated: they require multiple discretization/rounding steps, and then they output a weak learner, which finally needs to be boosted using multiple rounds of a replicable boosting scheme. As a result, the sample complexity of their algorithms incurs a significant blow-up in the parameters πœ–, 𝜏 compared to the non-replicable setting. We work in the setting of finite (but not necessarily bounded) bit precision with the goal of designing algorithms that are agnostic to the marginal distribution and sample complexity that is dimension-independent. Indeed, assuming bounded bit precision 𝑏, possibly some structure about the marginal distribution, and that the margin 𝜏 > 0, there are replicable algorithms for learning halfspaces with time and sample complexity 𝑂(poly(𝑑, 𝑏, 1/πœ–, log(1/𝜏), 1/𝜌)) via black-box transformations of existing (complicated) SQ algorithms (Balcan and Feldman, 2015; Dunagan and Vempala, 2004). In the regime when 𝜏is constant and 𝑑is relatively large, our algorithms can outperform the black-box transformations. 1.1. Our Contribution Impagliazzo et al. (2022) leave as an open question whether Replicable Learning of Large-Margin Halfspaces the sample complexity bounds of their algorithms are tight. In this work, we show that these bounds are sub-optimal. We provide new replicable algorithms for learning largemargin halfspaces that improve upon the results of Impagliazzo et al. (2022) in various aspects. Our algorithms have no sample dependence on 𝑑, strictly improve on the dependence on 1/πœ–, 1/𝜌, and 1/𝜏,2 and are proper, meaning that they output linear models. Moreover, our Algorithm 1 and Algorithm 2 are computationally efficient while Algorithm 3 forsakes computational efficiency to achieve further improvements in sample complexity. We now state our first algorithmic result, the proof of which can be found in Section 3. Theorem 1.3 (Efficient Replicable Algorithm 1). Fix πœ–, 𝜏, 𝜌, 𝛿 (0, 1). Let π’Ÿbe a distribution over R𝑑 { 1, 1} that has linear margin 𝜏as in Definition 1.2. There is an algorithm that is 𝜌-replicable and, given π‘š= 𝑂(πœ– 1𝜏 7𝜌 2 log(1/𝛿)) i.i.d. samples (π‘₯, 𝑦) π’Ÿ, computes in time poly(𝑑, 1/πœ–, 1/𝜏, 1/𝜌, log(1/𝛿)) a unit vector 𝑀 R𝑑such that Pr(π‘₯,𝑦) π’Ÿ[sgn(𝑀 π‘₯) = 𝑦] πœ–with probability at least 1 𝛿. Algorithm 1 improves on the sample complexity of the two algorithms appearing in Impagliazzo et al. (2022), runs in polynomial time, and is proper. Our techniques follow a different path from that of Impagliazzo et al. (2022). As we alluded to before, their approach is fairly complicated and is based on the design of a replicable weak halfspace learning algorithm and then a replicable boosting algorithm that combines multiple weak learners. Our approach is single-shot and significantly simpler: Consider 𝐡independent non-overlapping batches of training examples. From each batch 𝑖 [𝐡], we find a hyperplane with normal vector 𝑀𝑖 R𝑑that has Ω(𝜏) margin on the training data. This can be achieved by running the standard Support Vector Machine (SVM) algorithm (Cortes and Vapnik, 1995; Vapnik, 2006). We then aggregate our vectors to a single average normal vector 𝑧= (1/𝐡) 𝑖 [𝐡] 𝑀𝑖. Finally, we project the vector 𝑧onto a lower-dimensional space, whose dimension does not depend on 𝑑, and we replicably round 𝑧using a rounding scheme due to Alon and Klartag (2017) for which we perform a novel analysis in the shared randomness setting. We emphasize that our algorithm gives a halfspace with the desired accuracy guarantee without the need to use any boosting schemes. At a technical level, we avoid the dependence on the dimension 𝑑thanks to data-oblivious dimensionality reduction techniques (cf. Appendix B.3), a standard tool in the literature of large-margin halfspaces. Instead of rounding our estimates in the 𝑑-dimensional space, we first project them to a lower-dimensional space, whose dimension does 2By iteratively halving a guess for 𝜏, we may assume without loss of generality that 𝜏is known. not depend on 𝑑, and we perform the rounding in that space. The crucial idea is that one can use the data-obliviousness of Johnson-Lindenstrauss (JL) matrices so that the projection matrices in two distinct executions are the same since the internal randomness is shared. Another technical aspect of our algorithm that differentiates it from prior works on the design of replicable algorithms is the use of a different rounding scheme known as the Alon-Kartag rounding scheme (cf. Section 2). While this rounding scheme was known, to the best of our knowledge, we are the first to utilize and analyze the scheme in the context of replicability. Consider the simple case of 1-dimensional data. In the same spirit as in Impagliazzo et al. (2022), we consider a random grid. But rather than rounding the point to a fixed element of each cell of the grid (e.g., its center), we randomly round it to one of the two endpoints of the cell using shared internal randomness so that, in expectation, the rounded point is the same as the original one. This is helpful as it preserves inner products in expectation, and therefore gives better concentration properties across multiple roundings. We believe that this rounding scheme can find more applications in the replicable learning literature. The detailed proof of Theorem 1.3 can be found in Section 3. Despite the simplicity of our algorithm, there are technical subtleties that complicate its analysis. For instance, the projection to the low-dimensional space introduces a subtle complication we need to handle. In particular, using ideas from GrΓΈnlund et al. (2020) we can show that the aggregated vector in the high-dimensional space has the desired generalization properties. However, when we project it to the low-dimensional space there are vectors that are now misclassified, due to the error introduced by the JL mapping. Using the guarantees of the JL projection, we show that, uniformly over the data-generating distributions, this happens for only a small fraction of the population. Algorithm 2 follows a similar framework as Algorithm 1 by running a non-replicable algorithm on independent batches of data and then aggregating the outputs replicably, illustrating the flexibility of our approach. We now describe this result in more detail, whose proof can be found in Section 4. Theorem 1.4 (Efficient Replicable Algorithm 2). Fix πœ–, 𝜏, 𝜌, 𝛿 (0, 1). Let π’Ÿbe a distribution over R𝑑 { 1, 1} that has linear margin 𝜏as in Definition 1.2. There is an algorithm that is 𝜌-replicable and, given π‘š= 𝑂(πœ– 2𝜏 6𝜌 2 log(1/𝛿)) i.i.d. samples (π‘₯, 𝑦) π’Ÿ, computes in time poly(𝑑, 1/πœ–, 1/𝜏, 1/𝜌, log(1/𝛿)) a unit vector 𝑀 R𝑑such that Pr(π‘₯,𝑦) π’Ÿ[sgn(𝑀 π‘₯) = 𝑦] πœ–with probability at least 1 𝛿. Compared to Algorithm 1, our Algorithm 2 achieves better dependence on 𝜏by incurring an additional 1/πœ–fac- Replicable Learning of Large-Margin Halfspaces tor in the sample complexity. At a technical level, as in (LΓͺ Nguyen et al., 2020), we provide a convex surrogate that upper bounds the loss 1{𝑦(π‘₯ 𝑀) 𝜏/2}. Running SGD on this convex surrogate provides a unit vector 𝑀that, in expectation over the data, achieves a margin of at least 𝜏/2 for an 𝑂(πœ–)-mass of the population. We then apply a standard boosting trick to turn this guarantee into a high probability bound. Next, we work as in Algorithm 1: we run the above procedure 𝐡times to get 𝑀1, ..., 𝑀𝐡and aggregate our vectors into a single vector 𝑧= (1/𝐡) 𝑖 [𝐡] 𝑀𝑖. Lastly, we perform a JL-projection on 𝑧and then round using the Alon-Klartag rounding scheme, as in Algorithm 1. Computationally Inefficient Reductions from DP. It is a corollary of the works of Bun et al. (2023) and Kalavasis et al. (2023) that one can use existing differentially private (DP) algorithms in order to obtain replicable learners. In particular, following the reduction of Bun et al. (2023), one can obtain a replicable algorithm for large-margin halfspaces with better sample complexity in terms of 𝜏, but in a computationally inefficient way. The idea is to take an off-the-shelf DP algorithm (recall Definition F.1) for this problem (e.g. LΓͺ Nguyen et al. (2020)) and transform it into a replicable one. We remark that this transformation holds when the algorithm outputs finitely many different solutions and its running time is exponential in the number of these solutions. Fortunately, the pure DP algorithm from LΓͺ Nguyen et al. (2020) satisfies this finite co-domain property. The formal statement of the result we get by combining these two algorithms is presented below. Proposition 1.5 (Inefficient Replicable Algorithm; follows from LΓͺ Nguyen et al. (2020); Bun et al. (2023)). Fix πœ–, 𝜏, 𝜌, 𝛿 (0, 1). Let π’Ÿbe a distribution over R𝑑 { 1, 1} that has linear margin 𝜏as in Definition 1.2. There is an algorithm that is 𝜌-replicable and, given π‘š= 𝑂(πœ– 2𝜏 4𝜌 2 log(1/𝛿)) i.i.d. samples (π‘₯, 𝑦) π’Ÿ, com- putes, in time exp ( (1/𝜏) log(1/(πœ–πœŒπ›Ώ)) 𝜏2 ) poly(𝑑), a unit vector 𝑀 R𝑑such that Pr(π‘₯,𝑦) π’Ÿ[sgn(𝑀 π‘₯) = 𝑦] πœ–, with probability at least 1 𝛿. As mentioned above, the proof of this result follows by combining the DP-to-Replicability transformation of Bun et al. (2023) (cf. Proposition F.3) with the pure DP algorithm for learning large-margin halfspaces due to LΓͺ Nguyen et al. (2020) (cf. Proposition F.2). We note that since the algorithm of LΓͺ Nguyen et al. (2020) is proper and the reduction of Bun et al. (2023) is based on sub-sampling, the output of Proposition 1.5 is also a linear classifier. The main issue with this approach is that, apart from not being a polynomial time algorithm, the reduction requires a quadratic blow-up in the sample complexity of the provided DP algorithm. To be more specific, the DP algorithm of (LΓͺ Nguyen et al., 2020) has sample complexity 𝑂(πœ– 1𝜏 2) for accuracy πœ– and margin 𝜏. This means that the replicable algorithm of Proposition 1.5 incurs a quadratic blow-up in the sample complexity on the parameters πœ–, 𝜏. The cost of this transformation is tight under standard cryptographic hardness assumptions (Bun et al., 2023). Thus, it is unlikely that we can reduce the dependence on the error parameter πœ–using such a generic transformation. We emphasize that our efficient replicable algorithm (cf. Algorithm 1) has linear sample complexity dependence on 1/πœ–. The blow-up on the running time of the algorithm is due to the use of correlated sampling in the transformation of Bun et al. (2023) which requires exponential running time in the size of the output space. We remark that in the case of the algorithm of LΓͺ Nguyen et al. (2020), the size of the output space is already exponential in 1/𝜏2. In our work, we also revisit this inefficient algorithm and improve on its sample complexity and runtime, as follows: Theorem 1.6 (Improved Inefficient Replicable Algorithm 3). Fix πœ–, 𝜏, 𝜌, 𝛿 (0, 1). Let π’Ÿbe a distribution over R𝑑 { 1, 1} that has linear margin 𝜏as in Definition 1.2. Then there is a 𝜌-replicable algorithm such that given π‘š= 𝑂(πœ– 1𝜏 4𝜌 2 log(1/𝛿)) i.i.d. samples (π‘₯, 𝑦) π’Ÿ, computes in time poly(𝑑) poly(1/πœ–, 1/𝜏, 1/𝜌, 1/𝛿)1/𝜏2, a unit vector 𝑀 R𝑑satisfying Pr(π‘₯,𝑦) π’Ÿ[sgn(𝑀 π‘₯) = 𝑦] πœ– with probability at least 1 𝛿. Compared to the DP-to-Replicability reduction, Theorem 1.6 has better dependence on 1/πœ–and better running time. For the proof of this result, we refer to Section 5. 1.2. Related Work In terms of technique, a related work is that of (Nissim et al., 2016) for differentially private clustering. The authors also use the JL projection followed by a discretization step as part of their algorithm. Similar to our techniques, this serves to avoid a factor in the discretization scheme that scales with the dimension. One important difference is that (Nissim et al., 2016) solves a task on the samples while we need to solve a task on a distribution using samples. Moreover, while (Nissim et al., 2016) also use a discretization step (not based on the Alon-Kartag scheme) to find a dense part of the space to use as the center of the cluster, the discretization step in our work serves a different purpose: to ensure that our estimates, across two i.i.d. executions, will be the same. Replicability. Pioneered by Impagliazzo et al. (2022), there has been a growing interest from the learning theory community in studying replicability as an algorithmic property in various learning tasks. Among other things, Replicable Learning of Large-Margin Halfspaces their work showed that the fundamental class of statistical queries, which appears in various settings (see e.g., Blum et al. (2003); Gupta et al. (2011); Goel et al. (2020); Fotakis et al. (2021) and the references therein) can be made replicable. Subsequently, Esfandiari et al. (2023a;b) studied replicable algorithms in the context of multi-armed bandits and clustering. Later, Eaton et al. (2023); Karbasi et al. (2023) studied replicability in the context of Reinforcement Learning. Recently, Bun et al. (2023); Kalavasis et al. (2023; 2024) established equivalences between replicability and other notions of algorithmic stability such as differential privacy (DP), and Moran et al. (2023) derived more fine-grained characterizations of these equivalences. It is worth mentioning that Malliaris and Moran (2022) had already established equivalences between various notions of algorithmic stability and finiteness of the Littlestone dimension of the underlying concept class. Inspired by Impagliazzo et al. (2022), a related line of work (Chase et al., 2023b; Dixon et al., 2023; Chase et al., 2023a) proposed and studied alternative notions of replicability such as listreplicability, where the requirement is that when the algorithm is executed multiple times on i.i.d. datasets, then the number of different solutions it will output across these executions is small. Large-Margin Halfspaces. The problem of learning large-margin halfspaces has been extensively studied and has inspired various fundamental algorithms (Rosenblatt, 1958; Vapnik, 1999; Freund and Schapire, 1997; 1998). In the DP setting, Blum et al. (2005) gave a dimensiondependent construction based on a private version of the perceptron algorithm. This was later improved by LΓͺ Nguyen et al. (2020) who gave new DP algorithms for this task with dimension-independent guarantees based on the Johnson-Lindenstrauss transformation. Next, Bun et al. (2020) constructed noise-tolerant and private PAC learners for large-margin halfspaces whose sample complexity also does not depend on the dimension. Beimel et al. (2019) and Kaplan et al. (2020) designed private algorithms for learning halfspaces without margin guarantees when the domain is finite. Bassily et al. (2022b) stated an open problem of finding optimal DP algorithms for learning largemargin halfspaces both with respect to their running time and their sample complexity. Bassily et al. (2022a) studied DP algorithms for various learning tasks with margin, including halfspaces, kernels, and neural networks. In the area of robust statistics, Diakonikolas et al. (2023) showed a statistical-computational tradeoff in the problem of PAC learning large-margin halfspaces with random classification noise. For further results on robustly learning largemargin halfspaces, we refer to Diakonikolas et al. (2019) and the references therein. 2. The Main Tool: The Alon-Klartag Rounding Scheme Inspired by Alon and Klartag (2017), we introduce and use the following rounding scheme AKround(𝑧, 𝛽) for a point 𝑧with parameter 𝛽: let π‘œ= (π‘œ1, . . . , π‘œπ‘˜) 𝑖.𝑖.𝑑. π‘ˆ[0, 𝛽] be uniformly random offsets and implicitly discretize Rπ‘˜ using a grid of side length 𝛽centered at π‘œ. Let π‘œ(𝑧) Rπ‘˜ denote the bottom-left corner of the cube in which 𝑧lies, i.e., the point obtained by rounding down all the coordinates of 𝑧. For a vector 𝑧, we let 𝑧[𝑖] be its 𝑖-th coordinate. Define 𝑝(𝑧)[𝑖] [0, 1] to be such that 𝑝(𝑧)[𝑖] π‘œ(𝑧)[𝑖] + (1 𝑝(𝑧)[𝑖]) (π‘œ(𝑧)[𝑖] + 𝛽) = 𝑧[𝑖] . Given offsets π‘œand thresholds 𝑒= (𝑒1, . . . , π‘’π‘˜) with 𝑒𝑖 π‘ˆ[0, 1], round a vector 𝑧to π‘“π‘œ,𝑒(𝑧) where the 𝑖-th coordinate is equal to π‘œ(𝑧)[𝑖] if 𝑒𝑖 𝑝(𝑧)[𝑖] and π‘œ(𝑧)[𝑖]+𝛽 otherwise. Crucially, in expectation, the rounded point π‘“π‘œ,𝑒(𝑧) coincides with 𝑧. The next lemma, whose proof can be found in Appendix C, is useful in order to derive the replicability guarantees of our rounding scheme. Lemma 2.1 (Stability of Rounding). Let 𝑧, 𝑧 Rπ‘˜. Then for independent uniform offsets π‘œ1, . . . , π‘œπ‘˜ [0, 𝛽] and thresholds 𝑒1, . . . , π‘’π‘˜ [0, 1], we have Prπ‘œ,𝑒[π‘“π‘œ,𝑒(𝑧) = π‘“π‘œ,𝑒(𝑧 )] 2𝛽 1 𝑧 𝑧 1. Our novel analysis of the stability of the rounding scheme under shared randomness (cf. Lemma 2.1) demonstrates its useful properties for designing replicable algorithms. We believe these properties may be of interest beyond the scope of this work and can find applications in designing replicable algorithms for different problems. Next, we show that the Alon-Klartag rounding scheme additively preserves inner products with high probability. This is formalized below in a lemma whose proof can be found in Appendix C. Lemma 2.2 (Rounding preserves Inner Products). Let 𝑧, π‘₯ Rπ‘˜be such that π‘₯ 1. For uniform offsets π‘œ1, . . . , π‘œπ‘˜ [0, 𝛽] and thresholds 𝑒1, . . . , π‘’π‘˜ [0, 1], we have Prπ‘œ,𝑒 [ |π‘“π‘œ,𝑒(𝑧) π‘₯ 𝑧 π‘₯| > 𝛼 ] 2 exp ( 2𝛼2𝛽 2) . It is worth mentioning that the Alon-Klartag rounding scheme, along with dimensionality reduction techniques, was also used by GrΓΈnlund et al. (2020) in order to prove generalization bounds for SVMs. 3. Replicably Learning Large-Margin Halfspaces: Algorithm 1 In this section, we describe our first algorithm and prove its guarantees as stated in Theorem 1.3. Let π’Ÿbe a distribution Replicable Learning of Large-Margin Halfspaces over R𝑑 { 1, 1} with linear margin 𝜏 (0, 1) (cf. Definition 1.2). Thus, there is a unit vector 𝑀 R𝑑such that for all (π‘₯, 𝑦) supp(𝐷), π‘₯ = 0, we have 𝑦(π‘₯ 𝑀 / π‘₯ ) 𝜏. Given πœ–, 𝜌, 𝛿 (0, 1), our goal is to design a 𝜌-replicable learning algorithm that draws π‘š= π‘š(πœ–, 𝜏, 𝜌, 𝛿) i.i.d. samples from π’Ÿand outputs ˆ𝑀 R𝑑such that, with probability at least 1 𝛿over the randomness of the samples and (potentially) the internal randomness of the algorithm, it holds that Pr(π‘₯,𝑦) π’Ÿ[𝑦( ˆ𝑀 π‘₯) 0] πœ–. Description of Algorithm 1. We consider 𝐡batches of 𝑛samples each. Hence, in total, we draw 𝑛𝐡i.i.d. samples from π’Ÿ. On each batch 𝑖 [𝐡], we run the standard SVM algorithm (cf. Lemma B.1) to find a hyperplane with normal vector 𝑀𝑖 R𝑑that has margin at least 𝜏/2 on all training data in the batch. We then compute the average normal vector 𝑧= (1/𝐡) 𝑖 [𝐡] 𝑀𝑖. Finally, we round 𝑧 as described in Section 2. Algorithm 1 Replicable Large-Margin Halfspaces 1: π‘˜ 𝐢1𝜏 2 log(1/πœ–πœπœŒπ›Ώ) 2: 𝐡 𝐢2𝜏 4𝜌 2 log(1/πœ–πœπœŒπ›Ώ) 3: 𝑛 𝐢3πœ– 1𝜏 3 log(1/πœ–πœπœŒπ›Ώ) 4: 𝛽 𝐢4𝜏/ log(1/πœ–πœπœŒπ›Ώ) 5: for 𝑖= 1, 2, ...𝐡do 6: 𝑆𝑖= batch of 𝑛i.i.d. samples from π’Ÿ 7: 𝑀𝑖 SVM(𝑆𝑖, 𝜏/2) 8: end for 9: 𝑧 (1/𝐡) 𝑖 [𝐡] 𝑀𝑖 10: Draw 𝐴 Rπ‘˜ 𝑑with 𝐴𝑖,𝑗 𝒩(0, 1/π‘˜) (shared randomness) 11: 𝑏 AKround(𝐴𝑧, 𝛽) (shared randomness, cf. Section 2) 12: return ˆ𝑀= 𝐴 𝑏/ 𝐴 𝑏 Correctness of Algorithm 1. A straightforward adaptation of the results of GrΓΈnlund et al. (2020) (cf. Lemma B.1) shows that, with probability 1 𝛿/(10𝐡) over the samples, the classifier 𝑀𝑖has margin at least 𝜏/4 in a ( 1 𝑂 ( log 𝑛+log(𝐡/𝛿) 𝜏2𝑛 ) ) fraction of the population, i.e., Pr (π‘₯,𝑦) π’Ÿ[𝑦(𝑀 𝑖π‘₯)/ π‘₯ < 𝜏/4] 𝑂 ( log 𝑛+ log(𝐡/𝛿) We denote the complement of this event as 𝐸𝑖and condition on not observing 𝑖 [𝐡]𝐸𝑖. Now under this event, for all the vectors 𝑀𝑖, 𝑖 [𝐡] and all points (π‘₯, 𝑦) R𝑑 { 1, 1}, it holds that 𝑦(𝑀 𝑖π‘₯)/ π‘₯ 1, since 𝑀𝑖is a unit vector. Furthermore, for a (1 𝑂(1/𝜏2𝑛))-fraction of the population, the margin is at least 𝜏/4, i.e., 𝑦(𝑀 𝑖π‘₯)/ π‘₯ 𝜏/4. Intuitively, this means that the vector 𝑧= 1 𝐡 𝑖 [𝐡] 𝑀𝑖should have margin at least 𝜏/8, except for an 𝑂(1/(𝜏3𝑛)) fraction of the population. Formally, for (π‘₯, 𝑦) π’Ÿ, let 𝑍𝑖be the indicator variable such that 𝑦 (𝑀 𝑖π‘₯)/ π‘₯ < 𝜏/4 and 𝑍:= 𝑖 [𝐡] 𝑍𝑖. Then (π‘₯/ π‘₯ ) = 1 𝑖 [𝐡] 𝑦 𝑀 𝑖(π‘₯/ π‘₯ ) 𝐡( 𝑍+ (𝐡 𝑍)𝜏/4) = 𝜏/4 𝑍 𝐡(1 + 𝜏/4) . This means that if 𝑦(𝑧 π‘₯)/ π‘₯ < 𝜏/8, then 𝐡(1 + 𝜏/4) < 𝜏/8 = 𝑍> 𝜏 It suffices to bound the probability of the event that 𝑍> Ω(𝜏𝐡) to bound the population error of 𝑧. Notice that the summation of the fractions of the population where the 𝑀𝑖have margin less than 𝜏/4 is at most 𝑂(𝐡(log 𝑛+ log(𝐡/𝛿))/(𝜏2𝑛)). As noted above, at least Ω(𝜏𝐡) of the classifiers must simultaneously have margin less than 𝜏/4 for 𝑧to misclassify π‘₯. Thus the fraction of the population where 𝑧has margin smaller than 𝜏/8 is at most 𝑂 ( 𝐡(log 𝑛+ log(𝐡/𝛿)) ) = 𝑂 ( log 𝑛+ log(𝐡/𝛿) Thus, choosing 𝑛 Ω(log(𝐡/𝛿)/(𝜏3πœ–)) ensures that the intermediary normal vector 𝑧= (1/𝐡) 𝐡 𝑖=1 𝑀𝑖satisfies Pr(π‘₯,𝑦) π’Ÿ[𝑦(𝑧 π‘₯/ π‘₯ ) < 𝜏/8] πœ–/10 with probability at least 1 𝛿/10. The following lemma whose proof is deferred to Appendix D ensures that projecting and rounding in the lower dimension approximately preserves the performance of 𝑧 with respect to the 0-1 loss (as opposed to the 𝜏/8-loss). Lemma 3.1. Fix πœ–, 𝜏, 𝛿 (0, 1) and let π’Ÿbe a distribution over R𝑑 { 1} that admits a linear classifier with 𝜏-margin. Suppose 𝑧is a random unit vector satisfying Pr(π‘₯,𝑦) π’Ÿ[𝑦(𝑧 π‘₯/ π‘₯ ) < 𝜏/2] πœ–with probability at least 1 𝛿over the draw of 𝑧. Define π‘˜ = Ω(𝜏 2 log(1/πœ–π›Ώ)) and 𝛽 = 𝑂(𝜏/ log(1/πœ–π›Ώ)). If 𝐴 Rπ‘˜ 𝑑is a JL-matrix (cf. Appendix B.3) and 𝑏= AKround(𝐴𝑧, 𝛽) (cf. Section 2), then ˆ𝑀= 𝐴 𝑏/ 𝐴 𝑏 satisfies Pr(π‘₯,𝑦) π’Ÿ[𝑦( ˆ𝑀 π‘₯/ π‘₯ ) 0] 2πœ–with probability at least 1 2𝛿. We note that an application of it with 𝜏 = 𝜏/4, πœ– = πœ–/10, and 𝛿 = 𝛿/10 yields that the final output ˆ𝑀of Algorithm 1 has 0-1 population error of at most πœ–/5 with probability at least 1 𝛿/5 as desired. Replicable Learning of Large-Margin Halfspaces Replicability of Algorithm 1. We now state the lemma from which the replicability guarantees of our algorithm follow. Its proof is deferred to Appendix D. Lemma 3.2. Fix πœ–, 𝜏, 𝜌, 𝛿 (0, 1). Suppose 𝑀1, . . . , 𝑀𝐡 and 𝑀 1, . . . , 𝑀 𝐡are i.i.d. random unit vectors for 𝐡= Ω(𝜏 4𝜌 2 log(1/πœ–πœπœŒπ›Ώ)) and 𝑧= (1/𝐡) 𝑖 [𝐡] 𝑀𝑖, 𝑧 = (1/𝐡) 𝑖 [𝐡] 𝑀 𝑖are their averages. Define π‘˜= Θ(𝜏 2 log(1/πœ–πœπœŒπ›Ώ)) and 𝛽= Θ(𝜏/ log(1/πœ–πœπœŒπ›Ώ). If 𝐴 Rπ‘˜ 𝑑is a JL-matrix (cf. Appendix B.3) and 𝑏 = AKround(𝐴𝑧, 𝛽), 𝑏 = AKround(𝐴𝑧 , 𝛽) (cf. Section 2), then 𝑏= 𝑏 with probability at least 1 𝜌over the draw of the 𝑀 𝑖, 𝑀𝑖 s, 𝐴, and AKround. Note that the 𝑀𝑖 s are i.i.d. unit vectors across all batches and two independent executions since the samples in the 2𝐡batches in the two executions are drawn from the same distribution π’Ÿand the output of the SVM algorithm depends only on its input sample. Hence an application of Lemma 3.2 ensures that Algorithm 1 is indeed 𝜌-replicable. Sample Complexity & Running Time of Algorithm 1. The sample complexity is 𝑛𝐡= 𝑂(πœ– 1𝜏 7𝜌 2 log(1/𝛿)). By inspection, we see that the total running time of Algorithm 1 is poly(𝑑, 𝑛) = poly(𝑑, 1/πœ–, 1/𝜏, 1/𝜌, log(1/𝛿)). 4. Replicably Learning Large-Margin Halfspaces: Algorithm 2 Let 𝐡𝑑 1 denote the unit β„“2-ball in 𝑑-dimensions. Our approach is inspired by the work of (LΓͺ Nguyen et al., 2020) that designed a similar SGD approach for learning largemargin halfspaces under differential privacy constraints. Consider the following surrogate loss β„Ž 𝐡𝑑 1 𝐡𝑑 1 { 1, +1} R 0 β„Ž(𝑀; π‘₯, 𝑦) := max ( 0, 2 2 1{𝑦(π‘₯ 𝑀) < 𝜏/2}. We remark that β„Ž(𝑀; π‘₯, 𝑦) 1 when 𝑦(π‘₯ 𝑀) 𝜏/2 and β„Ž(𝑀; π‘₯, 𝑦) = 0 when 𝑦(π‘₯ 𝑀) 𝜏. Also, since π‘₯, 𝑀 𝐡𝑑 1, an application of the Cauchy-Schwartz inequality reveals that β„Ž(𝑀; π‘₯, 𝑦) [0, 2+2/𝜏]. Finally, β„Žis piecewise linear with each piece being 2/𝜏-Lipschitz. Hence, β„Žis 𝑂(1/𝜏)- Lipschitz. Description of Algorithm 2. Fix πœ– (0, 1). We seek to minimize the following loss function over the ball 𝐡𝑑 1: π‘“π’Ÿ(𝑀) := E (π‘₯,𝑦) π’Ÿ[β„Ž(𝑀; π‘₯, 𝑦)] + πœ– By construction, the co-domain of π‘“π’Ÿlies within [0, 2 + 2/𝜏+ πœ–/10] [0, 𝑂(1/𝜏)]. Note also that π‘“π’Ÿ is an upper bound on the 𝜏/2-population loss, i.e., Pr(π‘₯,𝑦) π’Ÿ[𝑦(π‘₯ 𝑀) < 𝜏/2] 𝑓(𝑀) . First, regarding the minima of π‘“π’Ÿ, note that any vector 𝑀 𝐡𝑑 1 achieving a margin of 𝜏satisfies π‘“π’Ÿ(𝑀) πœ–/10. This is because β„Ž(𝑀; π‘₯, 𝑦) = 0 for all (π‘₯, 𝑦) supp(π’Ÿ). As a result, min𝑀 𝐡𝑑 1 π‘“π’Ÿ(𝑀) πœ–/10. Second, let us consider an πœ–/10-optimal solution 𝑀 with respect to π‘“π’Ÿ, i.e., π‘“π’Ÿ(𝑀 ) min𝑀 𝐡𝑑 1 𝑓 πœ–/10 . The above discussion implies that π‘“π’Ÿ(𝑀 ) πœ–/5 and is thus a 𝜏/2-margin classifier for an πœ–/5-fraction of the population, i.e., 𝑀 satisfies Pr(π‘₯,𝑦) π’Ÿ[𝑦(π‘₯ 𝑀 ) < 𝜏/2] πœ–/5 . Note that we may assume without loss of generality that the marginal of π’Ÿover features is supported over 𝐡𝑑 1 since we normalize the input π‘₯ π‘₯/ π‘₯ before applying a classifier 𝑀. Since 𝑓is 𝑂(πœ–+ 1/𝜏) = 𝑂(1/𝜏)-Lipschitz and Ω(πœ–)- strongly convex, we can apply the following standard result: Theorem 4.1 (Theorem 6.2 in Bubeck et al. (2015)). Let 𝑓be πœ‡-strongly convex with minimizer 𝑀* and assume that the (sub)gradient oracle 𝑔(𝑀) satisfies E[ 𝑔(𝑀) 2] 𝐺2. Then after 𝑇iterations, projected stochastic gradient descent SGD(π’Ÿ, 𝑓, 𝑇) with step size πœ‚π‘‘= 2/πœ‡(𝑑+1) satisfies 2𝑑 𝑇(𝑇+ 1)𝑀𝑑 Since 𝐺2 = 𝑂(1/𝜏2), πœ‡= Ω(πœ–) and 𝑓(𝑀*) πœ–/10, choosing 𝑇 Ω(πœ– 2𝜏 2) yields an πœ–/10-optimal solution in expectation. Repeating this process independently for a small number of times and outputting the one with the lowest objective yields an πœ–/5-optimal solution with high probability. Lemma 4.2. Let 𝐡𝑑 1 be the unit β„“2-ball in 𝑑dimensions. Fix πœ–, 𝜏, 𝛿 (0, 1) and let π’Ÿbe a distribution over 𝐡𝑑 1 { 1} that admits a linear 𝜏-margin classifier. There is an algorithm boost SGD(π’Ÿ, πœ–, 𝜏, 𝛿) that outputs a unit vector 𝑀 R𝑑such that π‘“π’Ÿ( 𝑀) min𝑀 𝐡𝑑 1 π‘“π’Ÿ(𝑀) + πœ– with probability at least 1 𝛿. Moreover, boost SGD has sample complexity 𝑂(πœ– 2𝜏 2 log(1/𝛿)) and running time poly(1/πœ–, 1/𝜏, log(1/𝛿), 𝑑). The proof of Lemma 4.2 is deferred to Appendix E. Next, we repeat boost SGD and take an average to ensure concentration before proceeding as in Algorithm 1 with the random projection and rounding in the lower dimensional space. Compared to Algorithm 1, we obtain an improved sample dependence on 𝜏for Algorithm 2 since taking an average of πœ–-optimal solutions to a convex objective function yields an πœ–-optimal solution. However, we pay an extra factor of πœ–in order to run SGD as a subroutine. Replicable Learning of Large-Margin Halfspaces Algorithm 2 Replicable Large-Margin Halfspaces 1: 𝑛= 𝐢1πœ– 2𝜏 2 log(1/πœ–πœπœŒπ›Ώ) 2: 𝐡 𝐢2𝜏 4𝜌 2 log(1/πœ–πœπœŒπ›Ώ) 3: π‘˜ 𝐢3𝜏 2 log(1/πœ–πœπœŒπ›Ώ) 4: 𝛽 𝐢4𝜏/ log(1/πœ–πœπœŒπ›Ώ) 5: for 𝑖 1, . . . , 𝐡do 6: 𝑆𝑖 𝑛samples from π’Ÿ 7: 𝑆𝑖 {(π‘₯/ π‘₯ , 𝑦) : (π‘₯, 𝑦) 𝑆𝑖} 8: 𝑀𝑖 boost SGD(𝑆𝑖, πœ–/10, 𝜏, 𝛿/𝐡) (cf. Lemma 4.2) 9: end for 10: 𝑧 (1/𝐡) 𝑖 [𝐡] 𝑀𝑖 11: Draw 𝐴 Rπ‘˜ 𝑑with 𝐴𝑖,𝑗 𝒩(0, 1/π‘˜) (shared randomness) 12: 𝑏 AKround(𝐴𝑧, 𝛽) (shared randomness, cf. Section 2) 13: return ˆ𝑀= 𝐴 𝑏/ 𝐴 𝑏 The technical details of Algorithm 2 (the proof of Theorem 1.4) appear in Appendix E.1. 5. Replicably Learning Large-Margin Halfspaces: Algorithm 3 In this section, we provide an algorithm whose sample complexity scales as 𝑂(πœ– 1𝜏 4𝜌 2 log(1/𝛿)) and has running time poly(𝑑) (poly(1/πœ–, 1/𝜌, 1/𝜏, 1/𝛿))1/𝜏2. We remark that the sample complexity is better than the one obtained from the DP transformation in Proposition 1.5. Moreover, the running time is exponentially better than that obtained through the DP transformation. Before this, we state a very useful result due to Bun et al. (2023) related to the sample complexity of replicably learning finite hypothesis classes, which is used in the proof of Theorem 1.6. Proposition 5.1 (Theorem 5.13 in Bun et al. (2023)). Consider a finite concept class β„‹. There is a 𝜌replicable agnostic PAC learner r Learner Finite with accuracy πœ–and confidence 𝛿for β„‹with sample complexity 𝑛= 𝑂 ( log2 |𝐻|+log(1/πœŒπ›Ώ) πœ–2𝜌2 log3(1/𝜌) ) . Moreover, if π’Ÿis realizable then the sample complexity drops to 𝑂(πœ– 1𝜌 2 log2 |𝐻|). Finally, the algorithm terminates in time poly(|β„‹|, 𝑛). Description of Algorithm 3 Let us first provide the algorithm s pseudo-code. Algorithm 3 Replicable Large-Margin Halfspaces 1: π‘˜ 𝐢𝜏 2 log(1/πœ–πœπœŒπ›Ώ) 2: 𝑛 𝐢 πœ– 1𝜏 4𝜌 2 log(1/πœ–πœŒπœπ›Ώ) 3: 𝑆 batch of 𝑛i.i.d. samples from π’Ÿ 4: Draw 𝐴 Rπ‘˜ 𝑑with 𝐴𝑖,𝑗 𝒩(0, 1/π‘˜) (shared randomness) 5: 𝑆𝐴 (𝐴π‘₯1, 𝑦1), . . . , (𝐴π‘₯𝑛, 𝑦𝑛) 6: β„‹πœ a (𝜏/20)-net over vectors of length at most 1 in Rπ‘˜ 7: 𝑏 output of r Learner Finite from Proposition 5.1 with input 𝑆𝐴, β„‹πœ (shared randomness) 8: ˆ𝑀 𝐴 𝑏/ 𝐴 𝑏 Similar to the other algorithm, we first start by using a JL matrix to project the training set to a π‘˜-dimensional space, for π‘˜= Θ(𝜏 2). Then, we use a (𝜏/20)-net to cover all the unit vectors of this π‘˜-dimensional space, so the size of the net is (𝐢 /𝜏) 𝑂(𝜏 2), for some absolute constant 𝐢 > 0. We think of these points of the net as our hypothesis class β„‹πœ. By the properties of the JL transform, we can show that with high probability 1 𝛿/10, there exists a vector in this class that classifies the entire training set correctly. Moreover, we can show that this classifier has small generalization error. This is formalized in the following result, which is essentially a high-probability version of Lemma B.6 from LΓͺ Nguyen et al. (2020). Lemma 5.2. Fix πœ– , 𝛿𝐽𝐿 (0, 1). Let π’Ÿsatisfy Definition 1.2 with margin 𝜏and suppose 𝑀 satisfies 𝑦(𝑀 ) π‘₯ 𝜏for every (π‘₯, 𝑦) supp(π’Ÿ). For a JLmatrix 𝐴 Rπ‘˜ 𝑑, as stated in Lemma B.3 with π‘˜= Ω(𝜏 2 log(1/𝛿𝐽𝐿)), let 𝐺𝐴 R𝑑 { 1, 1} be the set of points (π‘₯, 𝑦) of supp(π’Ÿ) that satisfy 𝐴π‘₯ 2 π‘₯ 2 𝜏 π‘₯ 2/100, and, 𝑦(𝐴𝑀 / 𝐴𝑀 ) (𝐴π‘₯/ 𝐴π‘₯ ) 96𝜏/100. Let 𝐸1 be the event (over 𝐴) that Pr(π‘₯,𝑦) π’Ÿ[(π‘₯, 𝑦) 𝐺𝐴] 1 πœ– and 𝐸2 be the event (over 𝐴) that 𝐴𝑀 2 𝑀 2 𝜏 𝑀 2/100. Then it holds that Pr𝐴[𝐸1 𝐸2] 1 𝛿𝐽𝐿/πœ– . The proof of Lemma 5.2 appears in Appendix B.3.1. One way to view Lemma 5.2 is that, with probability 1 𝛿𝐽𝐿/πœ– over the random choice of 𝐴, the classifier 𝐴𝑀 / 𝐴𝑀 will have 96𝜏/100 margin on a 1 πœ– fraction of π’Ÿ, where the choice of πœ– will be specified later according to Lemma 5.2. Let us condition on this event for the rest of the proof, which we call πΈπ‘Ÿ. Let 𝑀 be the point of the net that 𝐴𝑀 / 𝐴𝑀 is rounded to. Recall that we round to the closest point on a 𝜏/20-net of the unit ball with respect to the β„“2-norm, so that 𝑀 is 𝜏/20 close to the normalized Replicable Learning of Large-Margin Halfspaces version of 𝐴𝑀 . Notice that under the event πΈπ‘Ÿ, for all points (π‘₯, 𝑦) 𝐺𝐴we have that 𝐴𝑀 𝐴π‘₯ 𝐴π‘₯ ( 𝐴𝑀 𝐴𝑀 𝑀 ) 𝐴π‘₯ 𝐴π‘₯ 𝐴𝑀 𝐴π‘₯ 𝐴π‘₯ 𝐴𝑀 96𝜏/100 𝜏/20 > 9/(10𝜏) , where the first inequality follows from Cauchy-Schwartz and the second inequality from the definition of the net. Since the hypothesis class β„‹πœhas finite size, we can use Proposition 5.1 from Bun et al. (2023) which states that 𝑂(πœ– 1𝜌 2 log(1/𝛿) log2 |β„‹|) samples are sufficient to 𝜌-replicably learn a hypothesis class β„‹in the realizable setting with error at most πœ–. One technical complication we need to handle is that π’Ÿis not necessarily realizable with respect to β„‹πœ. Nevertheless, under πΈπ‘Ÿ, we have shown that for 𝑀 β„‹πœit holds that Pr(π‘₯,𝑦) π’Ÿ[𝑦(( 𝑀 ) 𝐴π‘₯/ 𝐴π‘₯ ) < 9𝜏/10] πœ– . Let us denote by π’Ÿπ‘Ÿthe distribution π’Ÿconditioned on the event that 𝑦(( 𝑀 ) 𝐴π‘₯/ 𝐴π‘₯ ) 9𝜏/10 and π’Ÿπ‘its complement, i.e., π’Ÿconditioned on 𝑦(( 𝑀 ) 𝐴π‘₯/ 𝐴π‘₯ ) < 9𝜏/10. Then, we can express π’Ÿ= (1 𝑝𝑏) π’Ÿπ‘Ÿ+ 𝑝𝑏 π’Ÿπ‘, where 𝑝𝑏 πœ– . Hence, in a sample of size 𝑛from π’Ÿ, with probability at least 1 𝑛 πœ– we only see samples drawn i.i.d. from π’Ÿπ‘Ÿ. Let us call this event 𝐸 π‘Ÿand condition on it. Let us choose πœ– = 𝛿/(10𝑛), 𝛿𝐽𝐿= πœ– 2/10 = 𝛿2/1000𝑛so that Pr[πΈπ‘Ÿ 𝐸 π‘Ÿ] 1 𝛿. Under the events πΈπ‘Ÿ, 𝐸 π‘Ÿ, we can use the replicable learner from Bun et al. (2023) to learn the realizable distribution π’Ÿπ‘Ÿ. Run the algorithm from Proposition 5.1 with parameters 𝛿/10, πœ–/10, 𝜌/10.3 Since |β„‹| = (𝐢/𝜏)𝜏 2 for some absolute constant 𝐢, we see that we need 𝑛= 𝑂 ( πœ– 1𝜏 4𝜌 2 log(1/𝛿) ) samples. Replicability of Algorithm 3. Let us condition on the events πΈπ‘Ÿ, 𝐸 π‘Ÿacross two executions of the algorithm. Since, the matrix 𝐴is the same, due to shared randomness, under these events the samples that the learner of Bun et al. (2023) receives are i.i.d. from the same realizable distribution. Then, the replicability guarantee follows immediately from the guarantees of Proposition 5.1 and the fact that all the events we have conditioned on occur with probability at least 1 𝜌/2. Correctness of Algorithm 3. Here we examine the population error of the output of Proposition 5.1. Note that this 3We assume that 𝛿< 𝜌, πœ–, otherwise we normalize 𝛿to get the bound. error is analyzed with respect to the original distribution π’Ÿ, which is not necessarily realizable, rather than π’Ÿπ‘Ÿ, which is the distribution on which we run the learning algorithm. Let us again condition on the events πΈπ‘Ÿ, 𝐸 π‘Ÿ, and the event that the output of Proposition 5.1 satisfies the generalization bound it states. We assume without loss of generality that 𝛿 πœ–/2, otherwise we set 𝛿= πœ–/2 without affecting the overall sample complexity of our algorithm. Notice that these events occur with probability at least 1 𝛿. Let 𝑏be the output of the algorithm. Then we have Pr (π‘₯,𝑦) π’Ÿ[𝑦(𝑏 π‘₯) < 0] = (1 𝑝𝑏) Pr (π‘₯,𝑦) π’Ÿπ‘Ÿ[𝑦(𝑏 π‘₯) < 0] + 𝑝𝑏 Pr (π‘₯,𝑦) π’Ÿπ‘[𝑦(𝑏 π‘₯) < 0] (1 𝑝𝑏)πœ–/10 + 𝑝𝑏< πœ–. This concludes the proof. Sample Complexity & Running Time of Algorithm 3. As noted before, we need 𝑛= 𝑂 ( πœ– 1𝜏 4𝜌 2 log(1/𝛿) ) samples. Since we apply Proposition 5.1, we incur a running time of poly(𝑑) poly(1/πœ–, 1/𝜌, 1/𝜏, 1/𝛿)1/𝜏2. 6. Conclusion In this work, we have developed new algorithms for replicably learning large-margin halfspaces. Our results vastly improve upon prior work on this problem. We believe that many immediate questions for future research arise from our work. First, it is natural to ask whether there are efficient algorithms that can achieve the 𝑂(πœ– 1𝜏 4𝜌 2 log(1/𝛿)) sample complexity bound of Algorithm 3. Also, it would be interesting to see if there are any (not necessarily efficient) replicable algorithms whose sample complexity scales as 𝑂(πœ– 1𝜏 2𝜌 2) or if there is some inherent barrier to pushing the dependence on 𝜏below 𝜏 4. Finally, our analysis of Algorithm 1 is pessimistic in the sense that it uses a pigeonhole principle argument to establish that the fraction of the population where the aggregate vector does not have margin Ω(𝜏) is 𝑂(1/𝜏3𝑛). It would be interesting to see whether this bound can be improved to 𝑂(1/𝜏2𝑛) using a different analysis, which would reduce the overall dependence of the algorithm on 𝜏. Acknowledgements Kasper Green Larsen is supported by Independent Research Fund Denmark (DFF) Sapere Aude Research Leader grant No 9064-00068B. Amin Karbasi acknowledges funding in direct support of this work from NSF (IIS-1845032), ONR (N0001419-1-2406), and the AI Institute for Learning-Enabled Optimization at Scale (TILOS). Grigoris Velegkas is supported by TILOS, the Onas- Replicable Learning of Large-Margin Halfspaces sis Foundation, and the Bodossaki Foundation. Felix Zhou is supported by TILOS. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. Dimitris Achlioptas. Database-friendly random projections. In Proceedings of the twentieth ACM SIGMODSIGACT-SIGART symposium on Principles of database systems, pages 274 281, 2001. 14 Noga Alon and Bo az Klartag. Optimal compression of approximate inner products and dimension reduction. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 639 650. IEEE, 2017. 3, 5 Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 2016. 1 Maria-Florina Balcan and Vitaly Feldman. Statistical active learning algorithms for noise tolerance and differential privacy. Algorithmica, 72(1):282 315, 2015. doi: 10.1007/S00453-014-9954-9. URL https://doi. org/10.1007/s00453-014-9954-9. 2 Philip Ball. Is ai leading to a reproducibility crisis in science? Nature, 624(7990):22 25, 2023. 1 Raef Bassily, Mehryar Mohri, and Ananda Theertha Suresh. Differentially private learning with margin guarantees. Advances in Neural Information Processing Systems, 35:32127 32141, 2022a. 5 Raef Bassily, Mehryar Mohri, and Ananda Theertha Suresh. Open problem: Better differentially private learning algorithms with margin guarantees. In Conference on Learning Theory, pages 5638 5643. PMLR, 2022b. 5 Amos Beimel, Shay Moran, Kobbi Nissim, and Uri Stemmer. Private center points and learning of halfspaces. In Conference on Learning Theory, pages 269 282. PMLR, 2019. 5 Avrim Blum, Adam Kalai, and Hal Wasserman. Noisetolerant learning, the parity problem, and the statistical query model. Journal of the ACM (JACM), 50(4):506 519, 2003. 5 Avrim Blum, Cynthia Dwork, Frank Mc Sherry, and Kobbi Nissim. Practical privacy: the sulq framework. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACTSIGART symposium on Principles of database systems, pages 128 138, 2005. 5 SΓ©bastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015. 7 Mark Bun, Marco Leandro Carmosino, and Jessica Sorrell. Efficient, noise-tolerant, and private learning via boosting. In Conference on Learning Theory, pages 1031 1077. PMLR, 2020. 5 Mark Bun, Marco Gaboardi, Max Hopkins, Russell Impagliazzo, Rex Lei, Toniann Pitassi, Satchit Sivakumar, and Jessica Sorrell. Stability is stable: Connections between replicability, privacy, and adaptive generalization. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 520 527, 2023. 1, 2, 4, 5, 8, 9, 18 Zachary Chase, Bogdan Chornomaz, Shay Moran, and Amir Yehudayoff. Local borsuk-ulam, stability, and replicability. ar Xiv preprint ar Xiv:2311.01599, 2023a. 1, 5 Zachary Chase, Shay Moran, and Amir Yehudayoff. Stability and replicability in learning. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 2430 2439. IEEE, 2023b. doi: 10.1109/FOCS57990. 2023.00148. URL https://doi.org/10.1109/ FOCS57990.2023.00148. 1, 5 Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20:273 297, 1995. 1, 3 Ilias Diakonikolas, Daniel Kane, and Pasin Manurangsi. Nearly tight bounds for robust proper learning of halfspaces with a margin. Advances in Neural Information Processing Systems, 32, 2019. 5 Ilias Diakonikolas, Jelena Diakonikolas, Daniel M Kane, Puqian Wang, and Nikos Zarifis. Informationcomputation tradeoffs for learning margin halfspaces with random classification noise. In The Thirty Sixth Annual Conference on Learning Theory, pages 2211 2239. PMLR, 2023. 5 Peter Dixon, Aduri Pavan, Jason Vander Woude, and N. V. Vinodchandran. List and certificate complexities in replicable learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Replicable Learning of Large-Margin Halfspaces Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. 1, 5 John Dunagan and Santosh S. Vempala. A simple polynomial-time rescaling algorithm for solving linear programs. In LΓ‘szlΓ³ Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 315 320. ACM, 2004. doi: 10.1145/ 1007352.1007404. URL https://doi.org/10. 1145/1007352.1007404. 2 Eric Eaton, Marcel Hussing, Michael Kearns, and Jessica Sorrell. Replicable reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. 1, 5 Hossein Esfandiari, Alkis Kalavasis, Amin Karbasi, Andreas Krause, Vahab Mirrokni, and Grigoris Velegkas. Replicable bandits. In The Eleventh International Conference on Learning Representations, 2023a. 1, 5 Hossein Esfandiari, Amin Karbasi, Vahab Mirrokni, Grigoris Velegkas, and Felix Zhou. Replicable clustering. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b. 1, 5 Dimitris Fotakis, Alkis Kalavasis, Vasilis Kontonis, and Christos Tzamos. Efficient algorithms for learning from coarse labels. In Conference on Learning Theory, pages 2060 2079. PMLR, 2021. 5 Casper Benjamin Freksen. An introduction to johnson-lindenstrauss transforms. ar Xiv preprint ar Xiv:2103.00564, 2021. 14 Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55 (1):119 139, 1997. 1, 5 Yoav Freund and Robert E Schapire. Large margin classification using the perceptron algorithm. In Proceedings of the eleventh annual conference on Computational learning theory, pages 209 217, 1998. 5 Badih Ghazi, Ravi Kumar, and Pasin Manurangsi. Userlevel differentially private learning via correlated sampling. In Marc Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pages 20172 20184, 2021. 1 Surbhi Goel, Aravind Gollakota, and Adam Klivans. Statistical-query lower bounds via functional gradients. Advances in Neural Information Processing Systems, 33: 2147 2158, 2020. 5 Allan GrΓΈnlund, Lior Kamma, and Kasper Green Larsen. Near-tight margin-based generalization bounds for support vector machines. In International Conference on Machine Learning, pages 3779 3788. PMLR, 2020. 3, 5, 6, 13 Anupam Gupta, Moritz Hardt, Aaron Roth, and Jonathan Ullman. Privately releasing conjunctions and the statistical query barrier. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 803 812, 2011. 5 Russell Impagliazzo, Rex Lei, Toniann Pitassi, and Jessica Sorrell. Reproducibility in learning. In Stefano Leonardi and Anupam Gupta, editors, STOC 22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 818 831. ACM, 2022. doi: 10.1145/3519935.3519973. URL https: //doi.org/10.1145/3519935.3519973. 1, 2, 3, 4, 5 Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604 613, 1998. 14 William B Johnson. Extensions of lipshitz mapping into hilbert space. In Conference modern analysis and probability, 1984, pages 189 206, 1984. 14 Alkis Kalavasis, Amin Karbasi, Shay Moran, and Grigoris Velegkas. Statistical indistinguishability of learning algorithms. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 15586 15622. PMLR, 2023. URL https://proceedings.mlr. press/v202/kalavasis23a.html. 1, 2, 4, 5 Replicable Learning of Large-Margin Halfspaces Alkis Kalavasis, Amin Karbasi, Grigoris Velegkas, and Felix Zhou. On the computational landscape of replicable learning. ar Xiv preprint ar Xiv:2405.15599, 2024. 5 Haim Kaplan, Yishay Mansour, Uri Stemmer, and Eliad Tsfadia. Private learning of halfspaces: Simplifying the construction and reducing the sample complexity. Advances in Neural Information Processing Systems, 33: 13976 13985, 2020. 5 Amin Karbasi, Grigoris Velegkas, Lin Yang, and Felix Zhou. Replicability in reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. 1, 5 Guy Kindler, Anup Rao, Ryan O Donnell, and Avi Wigderson. Spherical cubes: optimal foams from computational hardness amplification. Communications of the ACM, 55 (10):90 97, 2012. 2 Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1895 1904. PMLR, 06 11 Aug 2017. URL https://proceedings.mlr. press/v70/kohler17a.html. 13 Huy LΓͺ Nguyen, Jonathan Ullman, and Lydia Zakynthinou. Efficient private algorithms for learning large-margin halfspaces. In Algorithmic Learning Theory, pages 704 724. PMLR, 2020. 4, 5, 7, 8, 14, 18 Maryanthe Malliaris and Shay Moran. The unstable formula theorem revisited. ar Xiv preprint ar Xiv:2212.05050, 2022. 5 Shay Moran, Hilla Schefler, and Jonathan Shafer. The bayesian stability zoo. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. 5 Kobbi Nissim, Uri Stemmer, and Salil P. Vadhan. Locating a small cluster privately. In Tova Milo and Wang-Chiew Tan, editors, Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 413 427. ACM, 2016. doi: 10.1145/2902251.2902296. URL https: //doi.org/10.1145/2902251.2902296. 4 Joelle Pineau, Koustuv Sinha, Genevieve Fried, Rosemary Nan Ke, and Hugo Larochelle. Iclr reproducibility challenge 2019. Re Science C, 5(2):5, 2019. 1 Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958. 1, 5 Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984. 2 Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 1999. 5 Vladimir Vapnik. Estimation of dependences based on empirical data. Springer Science & Business Media, 2006. 3 Replicable Learning of Large-Margin Halfspaces A. Comparison of Results Table A.1. A comparison of prior work and our work. We denote by 𝑑the dimension, πœ–the accuracy, 𝜌the replicability, and 𝜏the margin of the halfspace. We omit the logarithmic factors on the sample complexity and the runtime. Replicable Algorithms for Large-Margin Halfspaces Algorithms Sample Complexity Running Time Proper [ILPS22] with foams rounding (π‘‘πœ– 3𝜏 8𝜌 2)1.01 2𝑑 poly(1/πœ–, 1/𝜌, 1/𝜏) No [ILPS22] with box rounding (𝑑3πœ– 4𝜏 10𝜌 2)1.01 poly(𝑑, 1/πœ–, 1/𝜌, 1/𝜏) No Algorithm 1 (Theorem 1.3) πœ– 1𝜏 7𝜌 2 poly(𝑑, 1/πœ–, 1/𝜌, 1/𝜏) Yes Algorithm 2 (Theorem 1.4) πœ– 2𝜏 6𝜌 2 poly(𝑑, 1/πœ–, 1/𝜌, 1/𝜏) Yes [LNUZ20] via DP-to-Replicability reduction [BGH+23] (Proposition 1.5) πœ– 2𝜏 4𝜌 2 poly(𝑑) exp ( (1/𝜏) log(1/(πœ–πœŒ)) Algorithm 3 (Theorem 1.6) πœ– 1𝜏 4𝜌 2 poly(𝑑) poly(1/πœ–, 1/𝜌, 1/𝜏)1/𝜏2 Yes B. Deferred Tools B.1. SVM guarantees The following result is a restatement of Theorem 2 from GrΓΈnlund et al. (2020) Lemma B.1 (SVM Generalization Guarantee (GrΓΈnlund et al., 2020)). Let π’Ÿbe a distribution over R𝑑 { 1, 1}. Let 𝑛 N, 𝑆= (π‘₯1, 𝑦1), . . . , (π‘₯𝑛, 𝑦𝑛) π’Ÿπ‘›, and 𝑀 R𝑑be a unit vector such that 𝑦𝑖 ( 𝑀 π‘₯𝑖 π‘₯𝑖 ) 𝜏, 𝑖 [𝑛] Then, for every 𝛿> 0 with probability at least 1 𝛿over the random draw of 𝑆, it holds that Pr (π‘₯,𝑦) π’Ÿ[𝑦 (𝑀 π‘₯/ π‘₯ ) < 𝜏/2] 𝑂 ( log 𝑛+ log(1/𝛿) We remark that even though the result of GrΓΈnlund et al. (2020) is stated for the generalization error with respect to the misclassification probability, i.e., Pr(π‘₯,𝑦) π’Ÿ[𝑦 (𝑀 π‘₯/ π‘₯ ) 0], their argument also applies to the 𝜏/2-margin loss, i.e., Pr(π‘₯,𝑦) π’Ÿ[𝑦 (𝑀 π‘₯/ π‘₯ ) < 𝜏/2], via a straightforward modification of some constants. In more detail, the only needed change is in the proof of part 1 of Claim 10 in GrΓΈnlund et al. (2020). Here the first condition says 𝑦 (π‘₯ 𝑀) 0 and could be made e.g. 𝑦 (π‘₯ 𝑀) 𝜏/4. Their result then holds with 𝜏/4 instead of 𝜏/2. B.2. Vector-Valued Bernstein Concentration Inequality We will use the following concentration inequality for the norm of random vectors. Lemma B.2 (Vector Bernstein; (Kohler and Lucchi, 2017)). Let 𝑋1, . . . , 𝑋𝐡be independent random vectors with common dimension 𝑑satisfying the following for all 𝑖 [𝐡]: (i) E[𝑋𝑖] = 0 (iii) E[ 𝑋𝑖 2] 𝜎2 Replicable Learning of Large-Margin Halfspaces 𝐡 𝐡 𝑖=1 𝑋𝑖. Then for any 𝑑 (0, 𝜎2/πœ‡), Pr[ 𝑍 𝑑] exp ( 𝑑2𝐡 B.3. Johnson-Lindenstrauss Lemma The remarkable result of Johnson (1984) states that an appropriately scaled random orthogonal projection matrix preserves the norm of a unit vector with high probability. Indyk and Motwani (1998) showed that it suffices to independently sample each entry of the matrix from the standard normal distribution. Achlioptas (2001) further simplified the construction to independently sample each entry from the Rademacher distribution. See Freksen (2021) for a detailed survey of the development. From hereonforth, we say that a random matrix 𝐴 Rπ‘˜ 𝑑is a JL-matrix if either 𝐴𝑖𝑗 𝑖.𝑖.𝑑. 𝒩(0, 1/π‘˜) or 𝐴𝑖𝑗 𝑖.𝑖.𝑑. π‘ˆ{ 1/ We first state the standard distributional formulation of JL. Lemma B.3 (Distributional JL; (Johnson, 1984; Indyk and Motwani, 1998; Achlioptas, 2001)). Fix πœ–, 𝛿𝐽𝐿 (0, 1). Let 𝐴 Rπ‘˜ 𝑑be a JL-matrix for π‘˜= Ω(πœ– 2 log(1/𝛿𝐽𝐿)). Then for any π‘₯ R𝑑, Pr 𝐴[ 𝐴π‘₯ 2 π‘₯ 2 > πœ– π‘₯ 2] 𝛿𝐽𝐿. Let 𝑇be a set of vectors. By applying Lemma B.3 to the 𝑂(|𝑇|2) vectors 𝑒 𝑣for 𝑒, 𝑣 𝑇and taking a union bound, we immediately deduce the following result. Lemma B.4 (JL Projection; (Johnson, 1984; Indyk and Motwani, 1998; Achlioptas, 2001)). Fix πœ–, 𝛿𝐽𝐿 (0, 1). Consider a set 𝑇of 𝑑-dimensional vectors and a JL-matrix 𝐴 Rπ‘˜ 𝑑for π‘˜= Ω(πœ– 2 log(|𝑇|/𝛿𝐽𝐿)). Then, Pr 𝐴 [ 𝑒, 𝑣 𝑇: 𝐴(𝑒 𝑣) 2 𝑒 𝑣 2 > πœ– 𝑒 𝑣 2] 𝛿𝐽𝐿. An application of Lemma B.4 towards the polarization identity for 𝑧, π‘₯ R𝑑 4𝑧 π‘₯= 𝑧+ π‘₯ 2 𝑧 π‘₯ 2 yields the following inner product preservation guarantee. Corollary B.5 (JL Inner Product Preservation). Fix πœ–, 𝛿𝐽𝐿 (0, 1). Let 𝐴 Rπ‘˜ 𝑑be a JL-matrix for π‘˜= Ω(πœ– 2 log(1/𝛿𝐽𝐿)). Then, for any π‘₯, 𝑧 R𝑑, Pr 𝐴[|𝑧 π‘₯ (𝐴𝑧) 𝐴π‘₯| > πœ– 𝑧 π‘₯ ] 𝛿𝐽𝐿. The next lemma is another simple implication of the distributional JL. Lemma B.6 (Lemma 5 in LΓͺ Nguyen et al. (2020)). Let π’Ÿsatisfy Definition 1.2 with margin 𝜏. For a JL-matrix 𝐴as stated in Lemma B.3, let 𝐺𝐴 R𝑑 { 1, 1} be the set of points (π‘₯, 𝑦) of the population that satisfy 𝐴π‘₯ 2 π‘₯ 2 𝜏 π‘₯ 2/100 and 𝑦(𝐴𝑀 / 𝐴𝑀 ) (𝐴π‘₯/ 𝐴π‘₯ ) 96𝜏/100. Then it holds that Pr𝐴,(π‘₯,𝑦) π’Ÿ[(π‘₯, 𝑦) 𝐺𝐴] 1 𝛿𝐽𝐿. B.3.1. THE PROOF OF LEMMA 5.2 We finally prove Lemma 5.2, whose statement we repeat below for convenience. Lemma 5.2. Fix πœ– , 𝛿𝐽𝐿 (0, 1). Let π’Ÿsatisfy Definition 1.2 with margin 𝜏and suppose 𝑀 satisfies 𝑦(𝑀 ) π‘₯ 𝜏 for every (π‘₯, 𝑦) supp(π’Ÿ). For a JL-matrix 𝐴 Rπ‘˜ 𝑑, as stated in Lemma B.3 with π‘˜= Ω(𝜏 2 log(1/𝛿𝐽𝐿)), let 𝐺𝐴 R𝑑 { 1, 1} be the set of points (π‘₯, 𝑦) of supp(π’Ÿ) that satisfy Replicable Learning of Large-Margin Halfspaces 𝐴π‘₯ 2 π‘₯ 2 𝜏 π‘₯ 2/100, and, 𝑦(𝐴𝑀 / 𝐴𝑀 ) (𝐴π‘₯/ 𝐴π‘₯ ) 96𝜏/100. Let 𝐸1 be the event (over 𝐴) that Pr(π‘₯,𝑦) π’Ÿ[(π‘₯, 𝑦) 𝐺𝐴] 1 πœ– and 𝐸2 be the event (over 𝐴) that 𝐴𝑀 2 𝑀 2 𝜏 𝑀 2/100. Then it holds that Pr𝐴[𝐸1 𝐸2] 1 𝛿𝐽𝐿/πœ– . Proof of Lemma 5.2. We know from Lemma B.6 that [ Pr (π‘₯,𝑦) π’Ÿ[(π‘₯, 𝑦) / 𝐺𝐴] ] 𝛿𝐽𝐿, so Markov s inequality gives that [ Pr (π‘₯,𝑦) π’Ÿ[(π‘₯, 𝑦) / 𝐺𝐴] πœ– ] E𝐴 [ Pr(π‘₯,𝑦) π’Ÿ[(π‘₯, 𝑦) / 𝐺𝐴] ] Similarly, the guarantees of the JL projection immediately yields Pr 𝐴 [ 𝐴𝑀 2 𝑀 2 𝜏 𝑀 2/100 ] 𝛿𝐽𝐿. C. Details for Section 2 Here we fill in the missing proofs from Section 2. First, we restate and prove Lemma 2.1. Lemma 2.1 (Stability of Rounding). Let 𝑧, 𝑧 Rπ‘˜. Then for independent uniform offsets π‘œ1, . . . , π‘œπ‘˜ [0, 𝛽] and thresholds 𝑒1, . . . , π‘’π‘˜ [0, 1], we have Prπ‘œ,𝑒[π‘“π‘œ,𝑒(𝑧) = π‘“π‘œ,𝑒(𝑧 )] 2𝛽 1 𝑧 𝑧 1. Proof of Lemma 2.1. Fix a coordinate 𝑖 [π‘˜]. The probability that π‘œ(𝑧)[𝑖] = π‘œ(𝑧 )[𝑖] is at most |𝑧[𝑖] 𝑧 [𝑖]|𝛽 1. Assume π‘œ(𝑧)[𝑖] = π‘œ(𝑧 )[𝑖]. Then, by the definition of 𝑝(𝑧), 𝑝(𝑧 ) we have that |𝑧[𝑖] 𝑧 [𝑖]| = |(𝑝(𝑧)[𝑖] 𝑝(𝑧 )[𝑖])π‘œ(𝑧)[𝑖] + (𝑝(𝑧 )[𝑖] 𝑝(𝑧)[𝑖])(π‘œ(𝑧)[𝑖] + 𝛽)| = |(𝑝(𝑧 )[𝑖] 𝑝(𝑧)[𝑖])𝛽|. Note then the probability of π‘“π‘œ,𝑒(𝑧)[𝑖] = π‘“π‘œ,𝑒(𝑧 )[𝑖] is |𝑝(𝑧 )[𝑖] 𝑝(𝑧)[𝑖]| = |𝑧[𝑖] 𝑧 [𝑖]|𝛽 1. By the uniform choice of 𝑒𝑖, we thus conclude that Prπ‘œ,𝑒[π‘“π‘œ,𝑒(𝑧)[𝑖] = π‘“π‘œ,𝑒(𝑧 )[𝑖]] 2𝛽 1|𝑧[𝑖] 𝑧 [𝑖]|. A union bound over all π‘˜coordinates implies Prπ‘œ,𝑒[π‘“π‘œ,𝑒(𝑧) = π‘“π‘œ,𝑒(𝑧 )] 2𝛽 1 𝑧 𝑧 1. Next, we prove Lemma 2.2, whose statement is repeated. Lemma 2.2 (Rounding preserves Inner Products). Let 𝑧, π‘₯ Rπ‘˜be such that π‘₯ 1. For uniform offsets π‘œ1, . . . , π‘œπ‘˜ [0, 𝛽] and thresholds 𝑒1, . . . , π‘’π‘˜ [0, 1], we have Prπ‘œ,𝑒 [ |π‘“π‘œ,𝑒(𝑧) π‘₯ 𝑧 π‘₯| > 𝛼 ] 2 exp ( 2𝛼2𝛽 2) . Proof of Lemma 2.2. Each of the random variables π‘“π‘œ,𝑒(𝑧)[𝑖] lies in an interval of length 𝛽, are independent, and have expectation 𝑧[𝑖]. By linearity, π‘“π‘œ,𝑒(𝑧) π‘₯ 𝑧 π‘₯= (π‘“π‘œ,𝑒(𝑧) 𝑧) π‘₯. Let 𝑣= π‘“π‘œ,𝑒(𝑧) 𝑧. Then E[π‘₯[𝑖] 𝑣[𝑖]] = 0 and π‘₯[𝑖] 𝑣[𝑖] lies in a range of length 𝛽π‘₯[𝑖]. By Hoeffding s inequality, we have Pr[|𝑣 π‘₯| > 𝛼] < 2 exp ( 2𝛼2/( 𝑖𝛽2π‘₯[𝑖]2) ) 2 exp ( 2𝛼2𝛽 2) . Replicable Learning of Large-Margin Halfspaces D. Details for Algorithm 1 Here we fill in the missing proofs from Section 3. We begin with Lemma 3.1, whose statement we repeat below for convenience. Lemma 3.1. Fix πœ–, 𝜏, 𝛿 (0, 1) and let π’Ÿbe a distribution over R𝑑 { 1} that admits a linear classifier with 𝜏-margin. Suppose 𝑧is a random unit vector satisfying Pr(π‘₯,𝑦) π’Ÿ[𝑦(𝑧 π‘₯/ π‘₯ ) < 𝜏/2] πœ–with probability at least 1 𝛿over the draw of 𝑧. Define π‘˜= Ω(𝜏 2 log(1/πœ–π›Ώ)) and 𝛽= 𝑂(𝜏/ log(1/πœ–π›Ώ)). If 𝐴 Rπ‘˜ 𝑑is a JL-matrix (cf. Appendix B.3) and 𝑏= AKround(𝐴𝑧, 𝛽) (cf. Section 2), then ˆ𝑀= 𝐴 𝑏/ 𝐴 𝑏 satisfies Pr(π‘₯,𝑦) π’Ÿ[𝑦( ˆ𝑀 π‘₯/ π‘₯ ) 0] 2πœ–with probability at least 1 2𝛿. Proof of Lemma 3.1. Let π’Ÿπ‘§:= π’Ÿ| {𝑦(𝑧 π‘₯/ π‘₯ ) 𝜏/2} be the distribution obtained from π’Ÿby conditioning on 𝑧having large margin and let π’Ÿπ‘denote the distribution π’Ÿconditioned on the complement. We can decompose π’Ÿ= (1 𝑝𝑏) π’Ÿπ‘§+ 𝑝𝑏 π’Ÿπ‘for 𝑝𝑏 πœ–. To complete the correctness argument, we need to show that with high probability, the projection step of 𝑧, π‘₯onto the low-dimensional space 𝐴𝑧, 𝐴π‘₯and the rounding 𝑏= AKround(𝐴𝑧, 𝛽) approximately preserves the inner product 𝑦(𝑏 𝐴π‘₯/ π‘₯ ) for a 1 πœ–fraction of π’Ÿπ‘§. Then the final classifier 𝐴 𝑏still has low population error. In particular, it suffices to show that Pr(π‘₯,𝑦) π’Ÿπ‘§[𝑦(𝑏 𝐴π‘₯/ π‘₯ ) < 𝜏/4] πœ–with probability at least 1 𝛿over the random choice of 𝐴, 𝑏. Then, the total population error is at most πœ–+ 𝑝𝑏 2πœ–with probability at least 1 2𝛿. Fix (π‘₯, 𝑦) supp(π’Ÿπ‘§) and remark that = 𝑦 (𝐴𝑧) 𝐴 ( π‘₯ ) 𝑦 (𝐴π‘₯ 𝑏) 𝐴π‘₯ By the choices of π‘˜ Ω(𝜏 2 log(1/πœ–π›Ώ)), 𝛽 𝑂(𝜏/ log(1/πœ–π›Ώ)), the JL lemma (cf. Corollary B.5) and Alon-Kartag rounding scheme (cf. Lemma 2.2) ensures that [ Pr (π‘₯,𝑦) π’Ÿπ‘§[𝑦(𝑏 𝐴π‘₯/ π‘₯ ) < 𝜏/4] ] πœ–π›Ώ. An application of Markov s inequality yields [ Pr (π‘₯,𝑦) π’Ÿπ‘§[𝑦(𝑏 𝐴π‘₯/ π‘₯ ) < 𝜏/4] > πœ– ] 𝛿. All in all, with probability at least 1 𝛿over 𝐴, 𝑏, the population error is at most Pr (π‘₯,𝑦) π’Ÿ[𝑦(𝐴 𝑏) π‘₯ 0] Pr (π‘₯,𝑦) π’Ÿ[𝑦(𝑏 𝐴π‘₯/ π‘₯ ) < Θ(𝜏)] < 2πœ–, concluding the correctness argument of our algorithm. Next, we prove Lemma 3.2, whose statement we again repeat. Lemma 3.2. Fix πœ–, 𝜏, 𝜌, 𝛿 (0, 1). Suppose 𝑀1, . . . , 𝑀𝐡and 𝑀 1, . . . , 𝑀 𝐡are i.i.d. random unit vectors for 𝐡= Ω(𝜏 4𝜌 2 log(1/πœ–πœπœŒπ›Ώ)) and 𝑧= (1/𝐡) 𝑖 [𝐡] 𝑀𝑖, 𝑧 = (1/𝐡) 𝑖 [𝐡] 𝑀 𝑖are their averages. Define π‘˜= Θ(𝜏 2 log(1/πœ–πœπœŒπ›Ώ)) and 𝛽= Θ(𝜏/ log(1/πœ–πœπœŒπ›Ώ). If 𝐴 Rπ‘˜ 𝑑is a JL-matrix (cf. Appendix B.3) and 𝑏= AKround(𝐴𝑧, 𝛽), 𝑏 = AKround(𝐴𝑧 , 𝛽) (cf. Section 2), then 𝑏= 𝑏 with probability at least 1 𝜌over the draw of the 𝑀 𝑖, 𝑀𝑖 s, 𝐴, and AKround. Replicable Learning of Large-Margin Halfspaces Proof of Lemma 3.2. An application of the vector Bernstein concentration inequality (cf. Lemma B.2) yields the following: let 𝑆1 = { 𝑀(1) 𝑖 } 𝑖 [𝐡] and 𝑆2 = { 𝑀(2) 𝑖 } 𝑖 [𝐡] be two sets of independent and identically distributed random vectors in R𝑑 with 𝑀(𝑗) 𝑖 1 for all 𝑖 [𝐡], 𝑗 {1, 2}. Then we can set 𝑋𝑖= 𝑀(1) 𝑖 E 𝑀. Notice that 𝑋𝑖 2 and 𝑋𝑖 2 4. Hence, an application of Lemma B.2 yields for any 𝑑 (0, 2) 𝑀(1) 𝑖 E[𝑀] 𝑑 ] ) = exp ( Ω( 𝑑2𝐡) ) 𝑖 [𝐡] 𝑀(1) 𝑖 , 𝑧 = 1 𝐡 𝑖 [𝐡] 𝑀(2) 𝑖 . Choosing 𝐡 Ω(𝑑 2 log(1/𝜌)) ensures that Pr𝑆1,𝑆2 [ 𝑧 𝑧 𝑑] 𝜌/10. We condition on the event that 𝑧 𝑧 𝑑. Since we are sharing the randomness across the two executions, we use the same JL projection matrix 𝐴. Thus our choice of π‘˜ Ω(𝜏 2 log(1/𝜌)) gives that 𝐴𝑧 𝐴𝑧 (1 + 𝜏)𝑑 2𝑑, with probability at least 1 𝜌/10 (cf. Lemma B.4). It remains to show that after the rounding step, with probability at least 1 𝜌/10, the two rounded vectors will be the same. The size of our rounding grid is 𝛽= Θ(𝜏) and the target dimension of our JL-matrix is π‘˜= Θ(𝜏 2). Thus by Lemma 2.1, the probability that the two points round to different vectors is Θ(𝛽 1) 𝐴𝑧 𝐴𝑧 1 Θ(𝛽 1 π‘˜) 𝐴𝑧 𝐴𝑧 2 = Θ(𝑑 π‘˜/𝛽) (cf. Lemma 2.1). Thus, if we pick 𝐡 Ω(π‘˜/(𝛽2𝜌2)) = Ω(log(1/πœ–πœπœŒπ›Ώ)/(𝜏4𝜌2)), (D.1) we complete the argument. E. Details for Algorithm 2 Here we fill in the missing proofs from Section 4. We now prove Lemma 4.2, whose statement we repeat below for convenience. Lemma 4.2. Let 𝐡𝑑 1 be the unit β„“2-ball in 𝑑dimensions. Fix πœ–, 𝜏, 𝛿 (0, 1) and let π’Ÿbe a distribution over 𝐡𝑑 1 { 1} that admits a linear 𝜏-margin classifier. There is an algorithm boost SGD(π’Ÿ, πœ–, 𝜏, 𝛿) that outputs a unit vector 𝑀 R𝑑 such that π‘“π’Ÿ( 𝑀) min𝑀 𝐡𝑑 1 π‘“π’Ÿ(𝑀) + πœ–with probability at least 1 𝛿. Moreover, boost SGD has sample complexity 𝑂(πœ– 2𝜏 2 log(1/𝛿)) and running time poly(1/πœ–, 1/𝜏, log(1/𝛿), 𝑑). Proof of Lemma 4.2. Let 𝑇= Ω(𝜏 2πœ– 2). Run SGD(π’Ÿ, 𝑇) for 𝑛= 𝑂(log(1/𝛿)) times to obtain solutions 𝑀1, . . . , 𝑀𝑛. Theorem 4.1 ensures that we attain an πœ–/10-optimal solution in expectation for each 𝑀𝑖, 𝑖 [𝑛]. Markov s inequality then guarantees that we attain an πœ–/5-optimal solution with probability at least 1/2 in each repetition. But then with probability at least 1 𝛿/10, at least one of the 𝑛solutions is πœ–/5-optimal. By a Hoeffding bound, we can estimate each π‘“π’Ÿ(𝑀𝑖) [0, 𝑂(1/𝜏)] up to an additive πœ–/10 error with probability at least 1 𝛿/10 using 𝑂(π‘›πœ– 2𝜏 2 log(𝑛/𝛿)) samples. Outputting the classifier with the lowest estimated objective yields a 3πœ–/10-optimal solution with probability at least 1 𝛿/5. Normalizing the selected classifier to a unit vector can only decrease β„Žsince the feasible region is 𝐡𝑑 1 and incurs an additional loss of πœ–/10 due to the regularizer. Thus we have a 4πœ–/10-optimal solution with probability at least 1 𝛿/5. The total sample complexity is 𝑂(𝑛𝑇) + 𝑂(π‘›πœ– 2𝜏 2 log(𝑛/𝛿)) = 𝑂(πœ– 2𝜏 2 log(1/𝛿)). E.1. Proof of Theorem 1.4 Correctness of Algorithm 2. From the choice of 𝑛 Ω(πœ– 2𝜏 2 log(𝐡/𝛿)), Lemma 4.2 ensures that each unit vector 𝑀𝑖produced in Algorithm 2 is πœ–/10-optimal with probability at least 1 𝛿/(10𝐡). Hence with probability at least 1 𝛿/10, every 𝑀𝑖is πœ–/10-optimal with respect to π‘“π’Ÿ. By Jensen s inequality, 𝑧= (1/𝐡) 𝐡 𝑖=1 𝑀𝑖is πœ–/10-optimal with probability at least 1 𝛿/10. But then by the choice of π‘“π’Ÿas a convex surrogate loss for 1{𝑦(π‘₯ 𝑀) < 𝜏/2}, 𝑧satisfies Replicable Learning of Large-Margin Halfspaces Pr(π‘₯,𝑦) π’Ÿ[𝑦(𝑧 π‘₯/ π‘₯ ) < 𝜏/2] 2πœ–/10 with probability at least 1 𝛿/10. In other words, the unit vector 𝑧has a population 𝜏/2-loss of at most 2πœ–/10 with probability at least 1 𝛿/10. But then similar to the correctness of Algorithm 1, an application of Lemma 3.1 concludes the proof of correctness. Replicability of Algorithm 2. Similar to the correctness of Algorithm 1, we note that the output 𝑀𝑖of each execution of boost SGD is an i.i.d. unit vector. Thus an application of Lemma 3.2 yields the replicability guarantees of Algorithm 2. Sample Complexity & Running Time of Algorithm 2. The sample complexity is 𝑛𝐡= 𝑂(πœ– 2𝜏 6𝜌 2 log(1/𝛿)) as required. Once again, we see that Algorithm 2 terminates in poly(𝑑, 1/πœ–, 1/𝜏, 1/𝜌, log(1/𝛿)) by inspection. F. Details for Proposition 1.5 F.1. Differential Privacy For π‘Ž, 𝑏, 𝛼, 𝛽 [0, 1], let π‘Ž 𝛼,𝛽𝑏denote the statement π‘Ž 𝑒𝛼𝑏+ 𝛽and 𝑏 π‘’π›Όπ‘Ž+ 𝛽. We say that two probability distributions 𝑃, 𝑄are (𝛼, 𝛽)-indistinguishable if 𝑃(𝐸) 𝛼,𝛽𝑄(𝐸) for any measurable event 𝐸.4 Definition F.1 (Approximate Differential Privacy). A learning rule 𝐴is an 𝑛-sample (𝛼, 𝛽)-differentially private if for any pair of datasets 𝑆, 𝑆 (𝒳 {0, 1})𝑛that differ on a single example, the induced posterior distributions 𝐴(𝑆) and 𝐴(𝑆 ) are (𝛼, 𝛽)-indistinguishable. F.2. The Results of LΓͺ Nguyen et al. (2020) and Bun et al. (2023) Proposition 1.5 is based on combining the following two results. The first is a differentially private learner for large-margin halfspaces from LΓͺ Nguyen et al. (2020). Proposition F.2 (Theorem 6 in LΓͺ Nguyen et al. (2020)). Let 𝛼, 𝜏, πœ–, 𝛿> 0. Let π’Ÿbe a distribution over R𝑑 { 1, 1} that has linear margin 𝜏as in Definition 1.2. There is an algorithm that is (𝛼, 0)-differentially private and, given π‘š= 𝑂(𝛼 1πœ– 1𝜏 2) i.i.d. samples (π‘₯, 𝑦) π’Ÿ, computes in time exp ( 1/𝜏2) poly(𝑑, 1/𝛼, 1/πœ–, log(1/𝛿)) a normal vector 𝑀 R𝑑 such that Pr(π‘₯,𝑦) π’Ÿ[sgn(𝑀 π‘₯) = 𝑦] πœ–, with probability at least 1 𝛿. We also use the DP-to-Replicability reduction appearing in Bun et al. (2023). Proposition F.3 (Corollary 3.18 in Bun et al. (2023)). Fix 𝑛 N, sufficiently small 𝜌 (0, 1), πœ–, 𝛿 (0, 1) and 𝛼, 𝛽> 0. Let π’œ: 𝒳𝑛 𝒴be an 𝑛-sample (𝛼, 𝛽)-differentially private algorithm with finite output space solving a statistical task with accuracy πœ–and failure probability 𝛿. Then there is an algorithm π’œ : π’³π‘š 𝒴that is 𝜌-replicable and solves the same statistical task with π‘š= 𝑂(𝜌 2𝑛2) samples with accuracy πœ–and failure probability 𝛿. 4We use the notation (𝛼, 𝛽) DP instead of the more common (πœ–, 𝛿) DP to be consistent with the notation of the rest of the paper regarding the accuracy and probability of failure of the learning algorithms.