# the_space_complexity_of_approximating_logistic_loss__477074c9.pdf The Space Complexity of Approximating Logistic Loss Gregory Dexter Linked In Corporation gdexter@linkedin.com Petros Drineas Department of Computer Science Purdue University pdrineas@purdue.edu Rajiv Khanna Department of Computer Science Purdue University rajivak@purdue.edu We provide space complexity lower bounds for data structures that approximate logistic loss up to ϵ-relative error on a logistic regression problem with data X Rn d and labels y { 1, 1}d. The space complexity of existing coreset constructions depend on a natural complexity measure µy(X), first defined in [10]. We give an Ω( d ϵ2 ) space complexity lower bound in the regime µy(X) = O(1) that shows existing coresets are optimal in this regime up to lower order factors. We also prove a general Ω(d µy(X)) space lower bound when ϵ is constant, showing that the dependency on µy(X) is not an artifact of mergeable coresets. Finally, we refute a prior conjecture that µy(X) is hard to compute by providing an efficient linear programming formulation, and we empirically compare our algorithm to prior approximate methods. 1 Introduction Logistic regression is an indispensable tool in data analysis, dating back to at least the early 19th century. It was originally used to make predictions in social science and chemistry applications [14, 15, 3], but over the past 200 years it has been applied to all data-driven scientific domains, from economics and the social sciences to physics, medicine, and biology. At a high level, the (binary) logistic regression model is a statistical abstraction that models the probability of one of two alternatives or classes by expressing the log-odds (the logarithm of the odds) for the class as a linear combination of one or more predictor variables. Formally, logistic regression aims to find a parameter vector β Rd that minimizes the logistic loss, L(β), which is defined as follows: i=1 log(1 + e yix T i β), (1) where X Rn d is the data matrix (n points in Rd, with x T i being the rows of X) and y { 1, 1}n is the vector of labels whose entries are the yi. Due to the central importance of logistic regression, algorithms and methods to improve the efficiency of minimizing the logistic loss are always of interest [7, 10, 9]. The prior study of linear regression, a much simpler problem that admits a closed-form solution, has provided ample guidance on how we may expect to improve the efficiency of logistic regression. Let 38th Conference on Neural Information Processing Systems (Neur IPS 2024). us first consider how a data structure that approximates ℓ2-regression loss may be leveraged to design efficient algorithms for linear regression. Let D( ) : Rd R be a data structure such that: (1 ϵ) Ax b 2 D(x) (1 + ϵ) Ax b 2. Then, for, say, any ϵ (0, 1/3) x = argmin x Rd D(x) A x b 2 1 + ϵ 1 ϵ min x Rd Ax b 2 (1 + 3ϵ) min x Rd Ax b 2. That is, a data structure that approximates the ℓ2-regression loss up to ϵ-relative error may be used to solve the original regression problem up to 3ϵ-relative error. This is particularly interesting when D( ) has lower space complexity than the original problem or can be minimized more efficiently. Practically efficient data structures satisfying these criteria for linear regression have been instantiated through matrix sketching and leverage score sampling [16]. There is extensive work exploring constructions of a matrix S Rs n, where given a data matrix A Rn d and vector of labels b Rn, we may solve the lower dimensional problem x = argminx Rd SAx Sb 2 to achieve the guarantee A x b 2 1+ϵ 1 ϵ minx Rd Ax b 2 for a chosen ϵ > 0. Under conditions that guarantee that s n, we can achieve significant computational time and space savings by following such an approach. An important class of matrices S Rs n that guarantee the above approximation are the so-called ℓ2-subspace embeddings which satisfy: (1 ϵ) Ax b 2 S(Ax b) 2 (1 + ϵ) Ax b 2 for all x Rd. Despite the central importance of logistic regression in statistics and machine learning, relatively little is known about how the method behaves when matrix sketching and sampling are applied to its input. Munteanu et al. [10, 11] initiated the study of coresets for logistic regression. Meanwhile, Munteanu and Omlor [9] provide the current state-of-the-art bounds bounds on the size of a coreset for logistic regression. Analogously to linear regression, these works present an efficient data structure L( ) such that (1 ϵ)L(β) L(β) (1 + ϵ)L(β). (2) We call L( ) an ϵ-relative error approximation to the logistic loss. Prior work on coreset construction for logistic regression critically depends on the data complexity measure µy(X), which was first introduced in [10], and is defined as follows. Definition 1. (Classification Complexity Measure [10]) For any X Rn d and y { 1, 1}n, let µy(X) = sup β =0 (Dy Xβ)+ 1 (Dy Xβ) 1 , where Dy is a diagonal matrix with y as its diagonal, and (Dy Xβ)+ and (Dy Xβ) denote the positive and the negative entries of Dy Xβ respectively. Specifically, these methods construct a coreset by storing a subset of the rows indexed by S {1 . . . n} such that |S| = O( d µy(X) ϵ2 ) and generating a set of weights {wi}i S such that each wi is specified by O(log nd) bits [9]. The approximate logistic loss is then computed as: i S wi log(1 + e yix T i β) (3) is an ϵ-relative error approximation to L(β). We see that µy(X) is important in determining how compressible a logistic regression problem is through coresets, and prior has proven this dependency in coresets is necessary [17]. Our work further shows this dependency is fundamental to the space complexity of approximating logistic loss by any data structure. Our work advances understanding of data structures that approximate logistic loss to reduce its space and time complexity. Our results provide guidance on how existing coreset constructions may be improved upon as well as their fundamental limitations. 1.1 Our contributions We briefly summarize our contributions in this work; see Section 1.3 for notation. We prove that any data structure that approximates logistic loss up to ϵ-relative error must use Ω( d ϵ2 ) space in the worst case on a dataset with constant µ-complexity (Theorem 1). We show that any coreset that provides an ϵ-relative error approximation to logistic loss requires storing Ω( d ϵ2 ) rows of the original data matrix X (Corollary 2). Thereby, we prove that analyses of existing coreset constructions are optimal up to logarithmic factors in the regime where µy(X) = O(1). We prove that any data structure that approximates logistic loss to relative error must take Ω(d µy(X)) space, thereby showing that the dependency on the µ-complexity measure is not an artifact of using mergeable coresets which the prior work [17] had relied on (Theorem 3). We provide experiments that demonstrate how prior methods that only approximate µy(X) can be substantially inaccurate, despite being more complicated to implement than our method (see Section 4). Finally, we prove that low rank approximations can provide a simple but weak additive error approximation to logistic loss and these guarantees are tight in the worst case (See Appendix D). 1.2 Related Work Prior work has explored the space complexity of data structures that preserve Xβ p for β Rd, particularly in the important case where p = 2. Lower bounds for this problem are analogous to our work and motivate our inquiry. An early example of such work is [12], which lower bounds the minimum dimension of an oblivious subspace embedding , a particular type of data structure construction that preserves Xβ 2. A more recent example in this line of work is [6], which provides space complexity lower bounds for general data structures that preserves Xβ p for general p N. Recent work on the development of coresets for logistic regression motivates our focus on this problem. This line of work was initiated by Munteanu et al. [10]. Mai et al. [7] used Lewis weight sampling to achieve an O(dµy(X)2 ϵ 2). The work of Woodruff and Yasuda [17] later provided an online coreset construction containing O(dµy(X)2 ϵ 2) points as well as a coreset construction using O(d2µy(X) ϵ 2) points. Finally, Munteanu and Omlor [9] recently proved an O(dµy(X) ϵ 2) size coreset construction. Our work is complementary to the above works, since it highlights the limits of possible compression of the logistic regression problem while maintaining approximation guarantees to the original problem. 1.3 Notation We assume, without loss of generality, that yi = 1 for all i = 1 . . . n. Any logistic regression problem with (X, y) defined above can be transformed to this standard form by multiplying both X and y by the matrix Dy. Here Dy Rn n is a diagonal matrix with i-th entry set as yi. The logistic loss of the original problem is equal to that of the transformed problem for all β Rd. Let Mi denote the i-th row of a matrix M. We denote as X the matrix formed by stacking xi as rows. We will use O( ) and Ω( ) to suppress logarithmic factors of d, n, 1/ϵ, and µy(X). Finally, let [n] = {1, 2, ..., n}. 2 Space complexity lower bounds In this section, we provide space complexity lower bounds for a data structure L( ) that satisfies the relative error guarantee in eqn. 2. We use the notations as specified in Section 1. Additionally, we require throughout this section that the entries of X can be specified in O(log nd) bits. Our first main result is a general lower bound on the space complexity of any data structure which approximates logistic loss to ϵ-relative error for every parameter vector β Rd on a data set whose complexity measure µy(X) is bounded by a constant. As a corollary to this result, we show that existing coreset constructions are optimal up to lower order terms in the regime where µy(X) = O(1). Our second main result shows that any data structure providing a ϵ0-constant factor approximation to the logistic loss requires Ω(µy(X)) space, where ϵ0 > 0 is constant. Both of these results proceed by reduction to the INDEX problem [6, 8] (see Section A.2), where we use the fact that an approximation to the logistic loss can approximate Re LU loss under appropriate conditions. Consider the Re LU loss: i=1 max{x T i β, 0}. (4) Our lower bound reductions hinge on the fact that a relative error approximation to logistic loss can be used to simulate a relative error approximation to Re LU loss under appropriate conditions. We capture this in the following lemma. We include all proofs omitted from the main text in Appendix B. Lemma 2.1. Given a data set X Rn d and B Rd such that infβ B R(β; X) > 1, if there exists a data structure L( ) that satisfies: (1 ϵ)L(β) L(β) (1 + ϵ)L(β) for all β B, then there exists a data structure taking O(1) extra space such that: (1 3ϵ)R(β) R(β) (1 + 3ϵ)R(β) for all β B. 2.1 Lower bounds for the µy(X) = Θ(1) regime We lower bound the space complexity of any data structure that approximates logistic loss to ϵ-relative error. Recall that the running time of the computation compressing the data to a small number of bits and evaluating L(β) for a given query β is unbounded in this model. Hence, Theorem 1 provides a strong lower bound on the space needed for any compression of the data that can be used to compute an ϵ-relative error approximation to the logistic loss, including, but not limited to, coresets. At a high level, our proof operates by showing that a relative error approximation to logistic loss can be used to obtain a relative error approximation to Re Lu regression, which in turn can be used to construct a relative error ℓ1-subspace embedding. Previously, Li et al. [6] lower bounded the worst case space complexity of any data structure that maintains an ℓ1-subspace embedding by reducing the problem to the well-known INDEX problem in communication complexity. Notably, the construction of X has complexity measure bounded by a constant, i.e., µy(X) = O(1). Lemma 2.2. There exists a matrix X Rn d such that µy(X) 4 and any data structure that, with at least 2/3 probability, approximates R(β; X) up to ϵ-relative error requires Ω( d ϵ2 ) space, provided that d = Ω(log 1/ϵ), n = Ω(d/ϵ2), and 0 < ϵ ϵ0 for some small constant ϵ0. Furthermore, R(β; X) > 3 β 1 for all β Rd. Using Lemma 2.1, we can extend this lower bound on the space complexity to approximate Re LU loss to logistic loss. Theorem 1. Any data structure L( ) that, with at least 2/3 probability, is an ϵ-relative error approximation to logistic loss for some input (X, y), requires Ω d ϵ2 space in the worst case, assuming that d = Ω(log 1/ϵ), n = Ω dϵ 2 , and 0 < ϵ ϵ0 for some sufficiently small constant ϵ0. Proof. By Lemma 2.2, there exists a matrix X such that any data structure that approximates the Re LU loss up to ϵ-relative error requires the stated space complexity. Let B = {β Rd | β 1 = 1} Then, by Lemma 2.2, infβ B R(β; X) 3. Therefore, by Lemma 2.1, any (1 + ϵ) factor approximation to the logistic loss for the matrix X provides a (1 + 3ϵ)-factor approximation to R(β; X) for β B. Since R(β) = β 1 R(β/ β 1), we can extend this guarantee to all β Rd. By Lemma 2.2, any data structure that provides such a guarantee requires the stated space complexity, and, finally, L( ) requires the stated complexity. From the above theorem, we can derive a lower bound on the number of rows needed by a coreset that achieves an ϵ-relative error approximation to the logistic loss (see eqn. 3 for specification of a coreset). Corollary 2. Any coreset construction that, with at least 2/3 probability, satisfies the relative error guarantee in eqn. 2 must store Ω( d ϵ2 ) rows of some input matrix X, where µy(X) = O(1). Proof. Using the previous theorem, there exists a matrix X Rn O(log n) such that, assuming that n = Ω(ϵ 2) and ϵ is sufficiently small, any data structure that approximates the logistic loss up to relative error on X must use Ω(1/ϵ2) bits in the worst case. (Recall that Ω( ) suppresses log n factors.) If the data structure stores entire rows of X while storing a total of Ω( 1 ϵ2 ) bits, then it must store at least rows of X in total. Recall that the proof of Theorem 1 proceeds by showing that a relative error approximation to the logistic loss can be used to solve the INDEX problem. If we have d independent instances of the matrix X, i.e., X(1), X(2), ...X(d), then we may create the matrix X(1) 0 . . . 0 0 X(2) ... 0 ... ... ... 0 0 0 X(d) Note that any relative error approximation to the logistic loss on Y would allow relative error approximation to the logistic loss on each of the individual X(i), i = 1 . . . d matrices, thus allowing one to solve d instances of the INDEX problem simultaneously. This implies that we can query any bit in each of the d index problems which solves an INDEX problem of size Ω(d/ϵ2). If the data structure is restricted to store entire rows of X(i), then recall that we must store Ω(1/ϵ2) rows of X(i). Therefore, we conclude that any coreset that achieves a relative error approximation to the logistic loss on Y with at least 2/3 probability must store Ω(d/ϵ2) rows of Y. The work of Munteanu and Omlor [9] showed that sampling O d µy(X) ϵ2 rows of X yields an ϵ-relative error coreset for logistic loss with high probability. Hence, Corollary 2 implies that the coreset construction of Mai et al. [7] is of optimal size in the regime where µy(X) = O(1). However, Theorem 1 only guarantees that coresets are optimal up to a d factor in terms of bit complexity. An interesting future direction would be closing this gap by either proving that coresets have optimal bit complexity or showing approaches, like matrix sparsification, could achieve even greater space savings. 2.2 An Ω(µy(X) d) lower bound In this section, we provide a space complexity lower bound for a data structure achieving a constant ϵ0-relative error approximation to logistic loss on an input X with variable µy(X). We again assume yi = 1 for all i [n] without loss of generality. The proof depends on the existence of a matrix M { 1, 1}n log4 n that has nearly orthogonal rows. From M, we can construct the matrix X such that µy(X) = O(n) and a 2-factor approximation to Re LU loss on X can solve the size n INDEX problem. By Lemma 2.1 and the lower space complexity bound for solving the INDEX problem, we prove the described lower bound for approximating logistic loss. We begin by proving the existence of the matrix M. Recall that for any matrix M, we use Mi to denote the i-th row of M. Lemma 2.3. Let n = 2p for n N. There exists a matrix M { 1, 1}n k such that k = log4 n and | Mi, Mj | 4 log2 n for all i = j. We now use the previous lemma to construct a matrix X such that a 2-factor approximation to Re LU loss on X requires Ω(d µy(X)) space. Lemma 2.4. Let n = 2p for n N and assume that log4(n/2) > 16 log2 n. Then, there exists a matrix X Rn k such that k = O(log4 n) and µy(X) = O(n) such that any data structure R( ) that, with at least 2/3 probability, satisfies (for a fixed β Rd) R(β) R(β) 2R(β) and requires at least Ω(n) bits of space. Proof. Our proof will proceed by reduction to the INDEX problem. Let yi {0, 1} for all i = 1 . . . n/2] represent an arbitrary sequence of n/2 bits. We will show how to encode the state of the n bits in a relative error approximation to Re LU loss on some data set X. First, let us start with the matrix M { 1, 1}n/2 k specified in Lemma 2.3, where k = log4(n/2). Let M Rn/2 k such that Mi = Mi if yi = 1 and Mi = 1/2 Mi otherwise. In words, we multiply the i-th row of M by one if yi = 1 and 1/2 if yi = 0. Next, let us construct the matrix X Rn k +1: X = M 1 µ M µ 1 where µ > 0 will be specified later. Suppose we want to query yi. Let β = [ Mi, 4 log2 n]T . We will show that Mjβ < µ 8 log2 n for all j = i by considering three cases: Case 1 (j n/2; j = i): In this case, Mjβ = Mi, Mj 4 log2 n. By Lemma 2.3, Mi, Mj Mi, Mj < 4 log2(n/2), hence we can conclude this case. Case 2 (j > n/2; j = 2i): Here, Xjβ = µ Mi, Mj + 4µ log2 n. Since | Mi, Mj | | Mi, Mj | < 4 log2 n, Xjβ < µ 8 log2 n, so we conclude the case. Case 3 (j = 2i): In this case, Xjβ = µ Mi, Mi + 4µ log2 n. Since µ Mi, Mi is negative, we conclude the case. The above cases show that Xjβ < µ 8 log2 n when j = i. Therefore, R(β) n µ 8 log2 n + Re LU(Xiβ) and R(β) Re LU(Xiβ). We next show that the bit yi will have a large effect on R(β). If yi = 0, then, Xiβ = Mi, Mi = 1 4 Mi 2 2 4 log2 n < 1 4 log4(n/2), since Mi Rlog4(n/2). On the other hand, if yi = 1, then, Xiβ = Mi 2 2 4 log2 n = log4(n/2) 4 log2 n > 3 4 log4(n/2), where we used our assumption that log4(n/2) > 16 log2 n. Therefore, if yi = 0, R(β) 1 n 8µ log2 n. By setting µ = log2 n 26 n , we find that R(β) < 3 8 log4(n/2). On the other hand, if yi = 1, then R(β) > 3 4 log4 n/2. Therefore, a 2-factor approximation to R(β) is able to decide if yi equals zero or one. By reduction to the INDEX problem, this implies that any 2-factor approximation to R(β) must take at least Ω(n) space (see Theorem 6). Now we must just prove that µy(X) = O(n). We will use the following inequality [13]. For any two length n sequences of positive numbers, ar, r = 1 . . . n and br, r = 1 . . . n, Pn r=1 ar Pn r=1 br max r=1...n ar br , where the maximum is taken over an arbitrary fixed ordering of the sequences. Let us define these two sequences as follows for a fixed β Rd. For i 1 . . . n/2, if Xiβ > 0, let ai = Xiβ and bi = 1 X2iβ. If Xiβ < 0, let ai = X2iβ and bi = 1 Xiβ. We can disregard the case where Xiβ = 0, since this will not affect the sums of the sequences. Given such sequences, we get: (Xβ)+ 1 (Xβ) 1 = P r br max r ar br 1 The last inequality follows since X2iβ = µ Xiβ for i = 1 . . . n/2. Hence, we conclude that µy(X) = µ 1 = O(n). The above theorem proves that a constant factor approximation to R( ) requires Ω(µy(X)) space. We now extend this result to logistic loss. Theorem 3. Let n n0 (for some constant n0 N) be a positive integer. There exists a global constant ϵ0 > 0 and a matrix X Rn k such that any data structure L( ) that, with at least 2/3 probability, satisfies (for a fixed β Rd): (1 ϵ0)L(β; X) L(β; X) (1 + ϵ0)L(β; X) and requires at least Ω(d µy(X)) bits of space. Proof. The space complexity lower bound holds even if L( ) approximates R( ) only on the values of β used to query the data structure in the proof of Lemma 2.4. Define this set as B = {[ Mi, 4 log2 n]T Rk | i = 1 . . . n/2}, where M is used to construct X in eqn. 5. Since R(β) Xiβ, we get R(β) Mi 2 2 4 log2 n log4(n/2) 4 4 log2 n. Therefore, R(β) 1 for all n n0, where n0 is some constant in N. By Lemma 2.1, the space complexity of a data structure that achieves (1 ϵ)L(β) L(β) (1 + ϵ)L(β) for a fixed β B must be at least the space complexity of a data structure achieving R(β) R(β) (1 + 3ϵ) for β B. We can now solve for ϵ by setting (1+3ϵ)/(1 3ϵ) = 2. Therefore, from Lemma 2.4, we conclude that there exists a constant ϵ0 > 0 such that any data structure providing an ϵ0-relative approximation to the logistic loss requires at least Ω(n) = Ω(µy(X)) space. Finally, applying the argument used in Corollary 2 of constructing a matrix Y, we achieve an Ω(d µy(X)) lower bound. While prior work has show that mergeable coresets must include Ω(d µy(X)) points to attain a constant factor guarantee to logistic loss [17], our lower bound result holds for general data structures and applies even for data structure providing the weaker for-each guarantee, where the guarantee must hold for an arbitrary but fixed β Rd with a specified probability. The proof of the lower bound in [17] relies on constructing a matrix A that encodes n bits such that the i-th bit can be recovered by adding some points to A and performing logistic regression on the new matrix. Hence, a mergeable coreset that compresses A can be used to solve the INDEX problem of size n. Meanwhile, our proof does not require constructing a regression problem but rather allows recovering the i-th bit by only observing an approximate value of the Re LU loss at a single fixed input vector for a fixed matrix A. In addition to an arguably simpler proof, our approach needs weaker assumptions on the data structure. Therefore, our lower bound applies in more general settings, i.e., when sparsification is applied to X. 3 Computing the complexity measure µy(X) in polynomial time We present an efficient algorithm to compute the data complexity measure µy(X) of Definition 1 on real data sets, refuting an earlier conjecture that this measure was hard to compute [10]. The importance of this measure for logistic regression has been well-documented in prior work and further understanding its behavior on real-world data sets would help guide further improvements to coreset construction for logistic regression. Prior work conjectured that µy(X) was hard to compute and presented a polynomial time algorithm to approximate the measure to within a poly(d)-factor (see Theorem 3 of Munteanu et al. [10]). We refute this conjecture by showing that the complexity measure µy(X) can in fact be computed efficiently via linear programming, as shown in the following theorem. The specific LP formulation for computing a vector β such that µy(X) = (Dy Xβ ) 1 (Dy Xβ )+ 1 is given in eqn. (9) in the Appendix. Theorem 4. If the complexity measure µy(X) of Definition 1 is finite, it can be computed exactly by solving a linear program with 2n variables and 4n constraints. We conclude the section by noting that prior experimental evaluations of coreset constructions in [10, 7] relied on estimates of µy(X) using the method provided by Munteanu et al. [10]. We will empirically compare how prior methods of estimating the complexity measure compare to our exact method in Section 4. 4 Experiments We provide empirical evidence verifying the algorithm of Section 3 to exactly computes the classification complexity measure µy(X) of Definition 1. We compare our results with the approximate sketching-based calculation of Munteanu et al. [10]. In order to estimate µy(X) for a dataset using the sketching-based approach of Munteanu et al. [10], we create several smaller sketched datasets of a given full dataset of size n d (n rows and d columns). We then use a modified linear program along the lines of Munteanu et al. [10]. These new datasets are created so that they have number of rows n = O(d log(d/δ)), for various values of δ [0, 1], so that with probability at least 1 δ, the estimated µy(X) will lie between some lower bound (given by t, the optimum value of the aforementioned linear program) and an upper bound (given by t O(d log d)). In order to solve the modified linear programs, we make use of the OR-tools1. Synthetic data: We create the synthetic dataset as follows. First, we construct the full data matrix X Rn d by drawing n = 10, 000 samples from the d-dimensional Gaussian distribution N(0, Id) with d = 100. We generate an arbitrary β Rd with β 2 = 1 and generate the posterior p(yi|xi) = 1/(1 + exp( β xi + N(0, 1))). Finally, we create the labels yi for all i = 1 . . . n by setting yi to one if p(yi|xi) > 0.5 and to 1 otherwise. Using the full data matrix A = Dy X, we create several sketched data matrices having a number of rows equal to n = O(d log(d/δ)). We choose δ so that n {512, 1024, 2048, 4096, 8192}. These values of n are chosen to be powers of two so that we can employ the Fast Cauchy Transform algorithm (Fast L1Basis [1]) for sketching. The algorithm is meant to ensure that the produced sketch identifies ℓ1 well-conditioned bases for A, which is a prerequisite for using the subsequent linear program to estimate µy(X). (See Section C for details). The results are presented in Figure 1a. For various values of n , including when n = n = 10, 000, which we deem to be the original data size, we show the exact computation of µy(X) on the sketched matrix using the linear program of Theorem 4. We also show the corresponding upper and lower bounds on µy(X) of the full data set as estimated by the well-conditioned basis hunting approximation proposed by Munteanu et al. [10]. For the lower bound, we use the optimum value of the modified linear program as proposed in Munteanu et al. [10] and detailed in Section C. We set the upper bound by multiplying the lower bound by d log(d/δ). Note that this upper bound is conservative, and the actual upper bound could be a constant factor higher, since the guarantees of Munteanu et al. [10] are only up to a constant factor. The presented results are an average over 20 runs of different sketches 1https://developers.google.com/optimization 512 1024 2048 4096 8192 Number of sketched rows Exact Full Exact Sketched Appr Sketched Upper Appr Sketched Lower (a) Simulated data 1024 2048 4096 8192 Number of sketched rows Exact Full Exact Sketched Appr Sketched Lower Appr Sketched Upper (b) KDDCup dataset Figure 1: Simulated data results for exact computation of µy(X) (Theorem 4) using the full data (Exactfull), sketched data (Exact Sketched) vs the approximate upper (Appr Sketched Upper) and lower bounds (Appr Sketched Lower) as suggested by Munteanu et al. [10] (see Section C). The results clearly show that the upper and lower bounds can be very loose compared to our exact calculation of the complexity measure µy(X) for each value of n . Figure 1a shows that the exact computations on sketched matrices are close to the actual µy(X) of the full data matrix, while the upper and lower bounds as proposed by Munteanu et al. [10] can be fairly loose. Real data experiments: To test our setup on real data, we make use of the sklearn KDDcup dataset.2 The dataset consists of 100654 data points and over 50 features. We only use continuous features which reduces the feature size to 33. The dataset contains 3377 positive data points, while the rest are negative. To create various sized subsets for exact calculation, we subsample from positives and negatives so that they are in about equal proportion. For larger subsamples, we retain all the positives, and subsample the rest from the negative data points. Since the calculation of µ for the full dataset is intractable as it will require to solve an optimization problem of over 400k constraints, we subsample 16384 data points and use its µ as proxy for that of the full dataset (referred to as Exact Full. For such a large subsample, the error bars are small. We compare against sketching and exact µ calculations for smaller subset (See Figure 1b for the results). 5 Future work Our work shows that existing coresets are optimal up to lower order factors in the regime where µy(X) = O(1). A clear open direction would be proving a space complexity lower bound that holds for all valid values of ϵ and µy(X). Additionally, there is still a d factor gap between existing upper bounds [9] and our lower bounds (Theorems 1 and 3) in the regime where ϵ is constant and the complexity measure varies or the complexity measure is constant and ϵ varies respectively. Another interesting direction would be to explore whether additional techniques can further reduce the space complexity in approximating logistic loss compared to coresets alone. While Theorem 1 shows that the size of coreset constructions are essentially optimal, it does not preclude reducing the space by a d factor by using other methods. In particular, the construction of X used in the proof of Theorem 1 is sparse, and so existing matrix sparsification methods would save this d factor here. We also note that our first lower bound (Theorem 1) only applies to data structures providing the the for-all guarantee on the logistic loss, i.e., with a given probability, the error guarantee in eqn.(2) holds for all β Rd. It would be interesting to know if it could strengthened to apply in the for-each setting as Theorem 3 does, where β Rd is arbitrary but fixed. Finally, it would be useful to gain a better understanding of the complexity measure µy(X) in real data through more comprehensive experiments using our method provided in Theorem 4. In particular, subsampling points of a data set may create bias when computing µy(X). 2sklearn.datasets.fetch_kddcup99() provides an API to access this dataset. Acknowledgments We thank the anonymous reviewers for useful feedback on improving the presentation of this work. Petros Drineas and Gregory Dexter were partially supported by NSF AF 1814041, NSF FRG 1760353, and DOE-SC0022085. Rajiv Khanna was supported by the Central Indiana Corporate Partnership Analyti XIN Initiative. [1] Kenneth L. Clarkson, Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, Xiangrui Meng, and David P. Woodruff. The Fast Cauchy Transform and Faster Robust Linear Regression. Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 45(3): 466 477, 2013. [2] K.L. Clarkson, P. Drineas, M. Magdon-Ismail, M.W. Mahoney, X. Meng, and D.P. Woodruff. The fast Cauchy transform and faster robust linear regression. SIAM Journal on Computing, 45 (3), 2016. [3] J.S. Cramer. The Origins of Logistic Regression. SSRN Electronic Journal, 2003. [4] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963. ISSN 01621459. URL http: //www.jstor.org/stable/2282952. [5] Kishore Kumar and Jan Schneider. Literature survey on low rank approximation of matrices. Linear and Multilinear Algebra, 65(11):2212 2244, 2017. [6] Yi Li, Ruosong Wang, and David P Woodruff. Tight bounds for the subspace sketch problem with applications. SIAM Journal on Computing, 50(4):1287 1335, 2021. [7] Tung Mai, Cameron Musco, and Anup Rao. Coresets for classification simplified and strengthened. Advances in Neural Information Processing Systems, 34, 2021. [8] Peter Bro Miltersen, Noam Nisan, Shmuel Safra, and Avi Wigderson. On data structures and asymmetric communication complexity. In Proceedings of the twenty-seventh annual ACM symposium on Theory of computing, pages 103 111, 1995. [9] Alexander Munteanu and Simon Omlor. Optimal bounds for ℓp sensitivity sampling via ℓ2 augmentation. ar Xiv preprint ar Xiv:2406.00328, 2024. [10] Alexander Munteanu, Chris Schwiegelshohn, Christian Sohler, and David Woodruff. On coresets for logistic regression. Advances in Neural Information Processing Systems, 31, 2018. [11] Alexander Munteanu, Simon Omlor, and David Woodruff. Oblivious sketching for logistic regression. In Proceedings of the 38th International Conference on Machine Learning, pages 7861 7871, 2021. [12] Jelani Nelson and Huy L Nguyên. Lower bounds for oblivious subspace embeddings. In International Colloquium on Automata, Languages, and Programming, pages 883 894. Springer, 2014. [13] Daniel A. Spielman. Spectral graph theory: Dan s favorite inequality, September 2018. URL https://www.cs.yale.edu/homes/spielman/561/lect03b-18.pdf. [14] Pierre-François Verhulst. Notice sur la loi que la population poursuit dans son accroissement. Correspondance Mathématique et Physique, 10:113 121, 1838. [15] Pierre-François Verhulst. Recherches mathématiques sur la loi d accroissement de la population. Nouveaux Mémoires de l Académie Royale des Sciences et Belles-Lettres de Bruxelles, 18: 14 54, 1845. [16] David P. Woodruff. Sketching as a Tool for Numerical Linear Algebra. Foundations and Trends in Theoretical Computer Science, 10(1-2):1 157, 2014. [17] David P Woodruff and Taisuke Yasuda. Online lewis weight sampling. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 4622 4666. SIAM, 2023. A Preliminaries In this section, we list some useful results from prior work. A.1 Hoeffding s inequality We use the following formulation of Hoeffding s inequality. Theorem 5 (Theorem 2 in [4]). Let X1, . . . , Xn be independent random variables with mi Xi Mi, 1 i n. Then, for any t > 0, i=1 (Xi E[Xi]) t 2 exp 2t2 Pn i=1(Mi mi)2 A.2 INDEX problem Both Theorem 1 and Theorem 3 rely on a reduction to the randomized INDEX problem. We define the INDEX problem as a data structure as done in [6]. Definition 2. (INDEX problem) The INDEX data structure stores an input string y {0, 1}n and supports a query function, which receives input i [n] and outputs yi {0, 1} which is the i-th bit of the underlying string. The following theorem provides a space complexity lower bound for the INDEX problem. Theorem 6. (see [8]) In the INDEX problem, suppose that the underlying string y is drawn uniformly from {0, 1}n and the input i of the query function is drawn uniformly from [n]. Any (randomized) data structure for INDEX that succeeds with probability at least 2/3 requires Ω(n) bits of space, where the randomness is taken over both the randomness in the data structure and the randomness of s and i. Proof for Lemma 2.1 Proof. Let Rmin = infβ Rd R(β). We now prove that a data structure, L( ), which approximates logistic loss to relative error can be used to construct an approximation to the Re Lu loss, R( ), by R(β) = 1 t L(t β) for large enough constant t > 0. To show this, we start by bounding the approximation error of the logistic loss for a single point. First, we derive the following inequality when r > 0: t log(1 + et r) r = 1 log(ert) + log 1 + ert t log 1 + 1 Therefore, if x T i β > 0, then | 1 t log(1+et x T i β) x T i β| < 1 t et x T i β . Next, we consider the case where r 0, in which case Re Lu(r) = 0. It directly follows that 0 1 t log(1 + et r) et r t . Therefore, t log(1 + et r) max{0, r}| 1 t et |r| 1 We use these inequalities to bound the difference in the transformed logistic loss and Re Lu loss as follows: t L(t β) R(β)| = t log(1 + et x T i β) max{0, x T i β} t log(1 + et x T i β) max{0, x T i β}| n Therefore, if we set t = n ϵ Rmin(β), then t L(t β) R(β)| ϵRmin(β) ϵR(β), (7) for all β B. We can then show that L( ) can be used to approximate R( ) by applying the triangle inequality, the error guarantee of L( ) and eqn. 7. For all β B: 1 t L (t β) R(β) 1 t L (t β) 1 t L (t β) + 1 t L (t β) R(β) t L (t β) + ϵR(β) ϵ R(β) + R(β) 1 t L (t β) + ϵR(β) ϵ R(β) + ϵ2 R(β) + ϵ R(β) 3ϵR(β). Note that we can set Rmin to one and achieve the same error guarantee, since infβ B R(β) > 1. Therefore, storing t takes O(log n Proof for Lemma 2.2 Proof. We will first lower bound the space needed by any data structure which approximates Re Lu loss to ϵ-relative error. Later, we will show that this implies a lower bound on the space complexity of any data structure f( ) for approximating logistic loss. Let R( ) approximate R( ) such that R(β) R(β) (1 + ϵ)R(β) for all β Rd. We can rewrite R(β) as follows: i=1 max{0, x T i β} = 1/2 x T i β + 1/2 |x T i β| = 1 We next use the fact that R(β) Xβ 1 to get |R(β) R(β)| = |1 2 Xβ 1 R(β)| ϵR(β) 2 Xβ 1 R(β)| ϵ Xβ 1. We can store 1T X exactly in O(d) space as a length d vector. We define a new data structure H( ) such that H(β) = 2 R(β) 1T Xβ, and, using the above inequality, H(β) satisfies: | Xβ 1 H(β)| 2ϵ Xβ 1, for all β Rd. Therefore, H(β) is an ϵ-relative approximation to Xβ 1 after adjusting for constants and solves the ℓ1-subspace sketch problem (see Definition 1.1 in [6]). By Corollary 3.13 in [6], the data structure H( ) requires Ω d ϵ2 bits of space if d = Ω(log 1/ϵ) and n = Ω dϵ 2 . Therefore, we conclude that any data structure which approximates R(β) to ϵ-relative error for all β Rd with at least 2/3 probability must use Ω d ϵ2 bits in the worst case. The proof in [6] that leads to Corollary 3.13 proceeds by constructing a matrix A (X in our notation) and showing that ϵ-relative error approximations to Aβ 1 require the stated space complexity. We next show that, µy(A) 4. The construction of A is described directly above Lemma 3.10. This matrix A is first set to contain all 2k unique k length vectors in { 1, 1}k for some value of k. Each row of A is then re-weighted: specifically, the i-th row of A is re-weighted by some yi [2 For any vector β Rd, (Aβ)+ 1 4 (Aβ) 1 by the following argument. For an arbitrary i [n], let Ai = yi v where { 1}k. Then there exists a unique j [n] such that Aj = 1 yj v, and so Ajβ < 0. Furthermore, since yj 2 |Ajβ| 4. By applying this argument to each row of A where Aiβ > 0, we conclude that (Aβ)+ 1 4 (Aβ) 1, and hence µy(X) 4. Finally, we show that R(β; X) is lower bounded in this construction. Given any vector β Rd, there exists a row of X such that Xij βj 0 for all j {1 . . . d}, that is, the j-th entry of Xi has the same sign as βj in all non-zero entries of β. Therefore, Xiβ = yi β 1. Since yi 3 k, we conclude that R(β; X) 3 β 1. Proof of Lemma 2.3 Proof. Let A { 1, 1}n k be a random matrix where each entry is uniformly sampled from { 1, 1} in independent identical trials. Let {Rr}r [k] be independent Rademacher random variables. Then, P(| Ai, Aj | t) = P where the last step follows from applying Hoeffding s inequality. There are fewer than n2 tuples (i, j) {1 . . . d} {1 . . . d} such that i = j. Therefore, by an application of the union bound, P(| Ai, Aj | t for all i = j) n2 2 exp t2 Setting t = 4 k log n, we see that the right side of the inequality is less than one whenever n > 1. Therefore, there must exist a matrix M satisfying the specified criteria. Proof of Theorem 4 Proof. We now derive a linear programming formulation to compute the complexity measure in Definition 1. Note that we flip the numerator and denominator from Definition 1 without loss of generality. Let β Rd be:3 β = argmax β 2=1 (Dy Xβ) 1 (Dy Xβ)+ 1 µy(X) = (Dy Xβ ) 1 (Dy Xβ )+ 1 The second line above uses the fact that the definition of µy(X) does not depend on the scaling of β. If C is an arbitrary positive constant, then there exists a constant c > 0 such that: c β = argmax β Rd (Dy Xβ) 1 such that (Dy Xβ)+ 1 C = argmax β Rd Dy Xβ 1 such that (Dy Xβ)+ 1 C. Again, (Dy Xβ ) 1/ (Dy Xβ )+ 1 is invariant to rescaling of β , so we may assume that c is equal to one without loss of generality. We now reformulate the last constraint as follows: (Dy Xβ)+ 1 = i=1 max{[Dy Xβ]i, 0} 1 2[Dy Xβ]i + 1 2|[Dy Xβ]i| 21T Dy Xβ + 1 Therefore, the above formulation is equivalent to: β = argmax β Rd Dy Xβ 1 such that 1 21T Dy Xβ + 1 2 Dy Xβ 1 C = argmin β Rd 1T Dy Xβ such that Dy Xβ 1 C. 3For notational simplicity, we assume β is unique, but this is not necessary. Next, we replace Dy Xβ with a single vector z Rn and a linear constraint to guarantee that z Range(Dy X). Let PR Rn n be the orthogonal projection to Range(Dy X). Then, z = argmin z Rd 1T z (8) such that z 1 C and (I PR)z = 0. Next, we solve this formulation by constructing a linear program such that [z+, z ] R2n corresponds to the absolute value of the positive and negative elements of z, namely z = argmin β Rd 1T n(z+ z ) (9) such that 1T n(z+ + z ) C and (I PR)(z+ z ) = 0 and z+, z 0. Observe that this is a linear program with 2n variables and 4n constraints. After solving this program for z + and z , we can compute z = z + z . From this, we can compute β by solving the linear system z = Dy Xβ , which is guaranteed to have a solution by the linear constraint (I PR)z = 0. After solving for β , we can compute µy(X) = (Dy Xβ ) 1 (Dy Xβ )+ 1 , thus completing the proof. C Modified linear program For completeness, we reproduce the linear program of Munteanu et al. [10][Section A] to estimate the complexity measure µy(X): s.t. i [n] : (Uβ)i = ai bi i [d] : βi = ci di i=1 ci + di 1 i [n] : ai, bi 0 i [d] : ci, di 0. Here, U is ℓ1-well-conditioned basis [2] of the matrix Dy X. Because of the well-conditioned property, µ can be estimated to be within the bounds: t µy(X) poly(d).1 where t = min β 1=1 (Uβ) 1, and recall that d is the number of columns of X. Munteanu et al. [10] designed the above linear program to solve for t. However, note that trivially the LP as it is written could be trivially solved with ici = di = βi = 0 = ai = bi = 0, which unfortunately gives t = 0 always trivially. To get around this problem, we modify the above program as follows: s.t. i [n] : (Uβ)i = ai bi i [d] : hi = ci βi i [d] : hi = di + βi i [d] : ci Mvi i [d] : di M(1 vi) i [n] : ai, bi 0 i [d] : ci, di, hi 0 i [d] : vi {0, 1} Here, M is a sufficiently large value as is often used for Big-M constraints4 in linear programs. The variable hi simulates the |βi|, and is set according to the binary variable vi which decides which one of ci or di is 0. Further, note that P i hi 1 is equivalent to P i hi = 1 here since scaling down the norm of β can only bring the optimization cost down. To solve the optimization problem, we use the Google OR-tools and the wrapper pywraplp with the solver SAT which can handle integer programs since the above program is no longer a pure linear program because of the binary variables vi. D Low rank approximation to logistic loss Here, we provide a very simple data structure that provides an additive error approximation to the logistic loss. While the method is straightforward, we are unaware of this approximation being specified in prior work, and it may be useful in the natural setting where the input matrix X has low stable rank. We show that any low-rank approximation X of the data matrix X can be used to approximate the logistic loss function L(β) up to a n X X 2 β 2 additive error. The factor X X 2 β 2 is the spectral norm (or two-norm) error of the low-rank approximation and we also prove that this bound is tight in the worst case. Low rank approximations are commonly used to reduce the time and space complexity of numerical algorithms, especially in settings where the data matrix X is numerically low-rank or has a decaying spectrum of singular values. Using low-rank approximations of X to estimate the logistic loss is appealing due to the extensive body work on fast constructions of low-rank approximations via sketching, sampling, and direct methods [5]. We show that a spectral approximation provides an additive error guarantee for the logistic loss and that this guarantee is tight on worst-case inputs. Theorem 7. If X, X Rn d, then for all β Rd, |L(β; X) L(β, X)| n X X 2 β 2. 4See Linear and Nonlinear Optimization (2nd ed.). Society for Industrial Mathematics Proof. To simplify the notation, let s = Xβ and d = (X X)β. We can then write the difference in the log loss as: |L(β; X) L(β, X)| = i=1 log 1 + ex T i β + λ i=1 log 1 + e x T i β + λ i=1 log 1 + esi i=1 log 1 + esi 1 + esi |di| i=1 log 1 + esi 1 + e |di|esi i=1 log 1 e |di| 1 + esi i=1 |di| = d 1. Therefore, we can conclude that |L(β; X) L(β; X)| d 1 n d 2 n X X 2 β 2. We note that Theorem 7 holds for any matrix X Rn d that approximates X with respect to the spectral norm, and does not necessitate that X has low-rank. We now provide a matching lower-bound for the logistic loss function in the same setting. Theorem 8. For every d, n N where d n, there exists a data matrix X Rn d, label vector y { 1, 1}n, parameter vector β Rd, and spectral approximation X Rn d such that: |L(β; X) L(β; X)| (1 δ) n X X 2 β 2, for every δ > 0. Hence, the guarantee of Theorem 7 is tight in the worst case. Proof. To prove the theorems statement, we first consider the case of square matrices (d = n). In particular, first consider the case where d = n = 1, where X = [x] and X = [x + s], in which case X X 2 = s. Then, lim x L([1]; [x + s]) L([1]; [x]) = lim x log(1 + ex+s) log(1 + ex) = s Which shows that for β = [1] and x with large enough magnitude L(β; X) L(β; X) = (1 δ) X X 2. Next, let X = x In, X = (x + s) In, and β = 1n. Then for all i [n], x T i β = x and x T i β = x + s. Therefore, lim x L(β; X) L(β; X) = lim x log(1 + ex+s) log(1 + ex) = sn. Since X X 2 = s I 2 = s. β 2 = n, lim x L(β; X) L(β; X) = sn = n X X 2 β 2 Hence, we conclude the statement of the theorem for the case where d = n. To conclude the case for d n, note that n X X 2 β 2 does not change if we extend X and X with columns of zeroes and extend β with entries of zero until X, X Rn d and Rd. This procedure also does not change the loss at β, hence we conclude the statement of the theorem. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We state our contributions in Section 1.1 and link to the results in the paper. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We mention that we do not focus on lower order (logarithmic) factors in our theoretical analysis. Additionally, we point out more comprehensive experiments in computing µy(X) for real datasets as an open direction. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We have rigorously derived all theoretical results and provided clear references to the results we depend on in prior work. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Code and instructions to reproduce our results are provided in the supplementary material. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Code and instructions to reproduce our results are provided in the supplementary material. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We do not provide a learning algorithm, so there are no data splits or training. Code and instructions to reproduce our results are provided in the supplementary material. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We provide error bars in are figures that are sufficient to show statistical significance of the observed behavior. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [No] Justification: Our focus is primarily theoretical, and our experiments do not require significant compute resources. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We follow all of the relevant ethical guidelines. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: While we expect our work to benefit the development of methods for logistic regression, our work is far upstream from any applications, and hence it is not possible to productively consider broader impact. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: Our work is not related to any of these mentioned areas. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We give proper attribution for the code we use from prior work. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: We provide code with documentation in the supplementary material. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.