# truthful_high_dimensional_sparse_linear_regression__699bd863.pdf Truthful High Dimensional Sparse Linear Regression Liyang Zhu1, Amina Manseur1 , Meng Ding2, Jinyan Liu3 , Jinhui Xu2 , Di Wang1 1PRADA Lab, King Abdullah University of Science and Technology 2State University of New York at Buffalo 3Beijing Institute of Technology {liyang.zhu, amina.manseur, di.wang}@kaust.edu.sa {mengding, jinhu}@buffalo.edu jyliu@bit.edu.cn We study the problem of fitting the high dimensional sparse linear regression model with sub-Gaussian covariates and responses, where the data are provided by strategic or self-interested agents (individuals) who prioritize their privacy of data disclosure. In contrast to the classical setting, our focus is on designing mechanisms that can effectively incentivize most agents to truthfully report their data while preserving the privacy of individual reports. Simultaneously, we seek an estimator which should be close to the underlying parameter. We attempt to solve the problem by deriving a novel private estimator that has a closed-form expression. Based on the estimator, we propose a mechanism which has the following properties via some appropriate design of the computation and payment scheme: (1) the mechanism is (o(1), O(n Ω(1)))-jointly differentially private, where n is the number of agents; (2) it is an o( 1 n)-approximate Bayes Nash equilibrium for a (1 o(1))-fraction of agents to truthfully report their data; (3) the output could achieve an error of o(1) to the underlying parameter; (4) it is individually rational for a (1 o(1)) fraction of agents in the mechanism; (5) the payment budget required from the analyst to run the mechanism is o(1). To the best of our knowledge, this is the first study on designing truthful (and privacy-preserving) mechanisms for high dimensional sparse linear regression. 1 Introduction One fundamental learning task is estimating the linear regression model, with a wide array of applications ranging from statistics to experimental sciences like medicine [7, 23] and sociology [36]. These studies typically assume that analysts have high-quality data, which is essential to the success of the model. However, real-world data, such as medical records and census surveys, often contain sensitive information sourced from strategic, privacy-concerned individuals. In this case, data providers (agents)1 may be disinclined to reveal their data truthfully, potentially jeopardizing the accuracy of model estimation. Therefore, in contrast to conventional statistical settings, it becomes imperative to model the utility functions of individuals and engineer mechanisms that can concurrently yield precise estimators, safeguard the privacy of individual reports, and encourage the majority of individuals to candidly disclose their data to the analyst. The problem involves two intertwined components: data acquisition and privacy-preserving data analysis. The analyst must strategically compensate agents for potential privacy violations, considering the alignment of their reported data with both the statistical model and their peers contributions, while minimizing the payment budget. Additionally, the analyst must perform privacy-preserving computations to accurately learn the underlying model. As agents cannot refine their reports after seeing their payment, the interaction is designed to be completed in one round, creating a trade-off between estimator accuracy and the total payment budget needed for participant cooperation. 1In this paper, individuals, data providers, and participants are the same and all represent agents. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). In recent years, there has been a line of work studying truthful and privacy-preserving linear models such as [2, 37, 14, 38]. However, due to the complex nature of the problem, all of them only consider the low dimension case, where the dimension of the feature vector (covariate) is much lower than the sample size, i.e., they need to assume the dimension is a constant. In practice, we always encounter the high dimensional sparse case where the dimension of the feature vector is far greater than the sample size but the underlying parameter has an additional sparse structure. While the (high dimensional sparse) linear model has been widely studied in statistics and privacy-preserving machine learning [39, 24, 3, 52, 61, 53, 49, 28], to the best of our knowledge, there is no previous study on truthfully and privately estimating high dimensional sparse linear models due to some intrinsic challenges of the problem (see Section C in Appendix for details). Therefore, a natural question to ask is: Can we fit the high dimensional sparse linear regression and design mechanism which incentivizes most agents and truthfully report their data and preserves the privacy of the individuals? In this paper, we answer the question in the affirmative via providing the first study on the trade-off between privacy, the accuracy of the estimator, and the total payment budget for high dimensional sparse linear models where the dimension d can be far greater than the sample size n, while the sparsity k and log d are far less than n can be considered as constants. Specifically, for the privacypreserving data analysis part, we adopt the definition of Joint Differential Privacy (JDP) [31] to protect individuals data and develop a novel closed-form and JDP estimator for sparse linear regression, which is significantly different from the low dimension case and can be applied to other problems. Moreover, we develop a general DP estimator for the ℓp-sparse covariance matrix with p [0, 1] as a by-product, which extends the previous results on the ℓ0-sparse case. Based on our JDP estimator, via the peer prediction method, we then provide a payment mechanism that can incentivize almost all participants to truthfully report their data with a small amount of the total payment budget. In detail, our mechanism has the following properties. (1) The mechanism preserves privacy for individuals reported data, i.e., the output of the mechanism is (o(1), O(n Ω(1)))-JDP, where n is the number of agents. (2) The private estimator of the mechanism is o(1)-accurate, i.e., when the number of agents increases, our private estimator will be sufficiently close to the underlying parameter. (3) The mechanism is asymptotically truthful, i.e., it is an o( 1 n)-approximate Bayes Nash equilibrium for a (1 o(1))-fraction of agents to truthfully report their data. (4) The mechanism is asymptotically individually rational, i.e., the utilities of a (1 o(1))-fraction of agents are non-negative. (5) The mechanism only requires o(1) payment budget, i.e., when the number of participants increases, the total payment tends to zero. 2 Preliminaries Notations. Given a matrix X Rn d, let x T i be its i-th row and xij (or [X]ij) be its (i, j)-th entry (which is also the j-th element of the vector xi). For any p [1, ], X p is the p-norm, i.e., X p := supy =0 Xy p y p , and X , = maxi,j |xij| is the max norm of matrix X. For an event A, we let I[A] denote the indicator, i.e., I[A] = 1 if A occurs, and I[A] = 0 otherwise. The sign function of a real number x is a piece-wise function which is defined as sgn(x) = 1 if x < 0; sgn(x) = 1 if x > 0; and sgn(x) = 0 if x = 0. We also use λmin(X) to denote the minimal eigenvalue of X. For a sub-Gaussian random variable X, its sub-Gaussian norm X ψ2 is defined as X ψ2 = inf{c > 0 : E[exp( X2 c2 )] 2} (see Appendix E for more preliminaries). 2.1 Problem Setting We consider the problem of sparse linear regression in the high dimensional setting where d n. Suppose that we have a data universe D = X Y Rd R and n agents in the population, where each agent i has a feature vector xi X and a response variable yi Y (we denote Di = (xi, yi) and D as the whole dataset). We assume that {(xi, yi)}n i=1 are i.i.d. sampled from a sparse linear regression model, i.e., each (xi, yi) is a realization of the sparse linear regression model y = θ , x + ζ, where the distribution of x has mean zero, ζ is some randomized noise that satisfies E[ζ|x] = 0, and θ Rd is the underlying model estimator with sparsity assumption θ 0 k. In the following, we provide some assumptions related to the model. See Section A in Appendix for a table of notations. Assumption 1. We assume θ 2 1. Moreover, for the covariance matrix of x, Σ, there exist κ and κx such that Σw κ w , w = 0 and Σ 1 2 x ψ2 κx. 2 2To make our results comparable to the previous results, we assume κ and κx are constants for simplicity. These assumptions have been commonly adopted in some relevant studies such as [52, 14, 3]. Next, we present the assumptions made on the covariate vectors xi and response variable yi. Assumption 2. We assume that the covariates (feature vectors) x1, x2, , xn Rd are i.i.d. (zeromean) sub-Gaussian random vectors with variance σ2 d with σ = O(1). 3 The responses yi are i.i.d. (zero-mean) sub-Gaussian random variables with variance σ2 with σ = O(1). In our setting, the agents are self-interested or strategic. They concerned about their privacy and can misreport their responses {yi}n i=1 but not their features {xi}n i=1. This means the features are directly observable but the response (e.g., during physical examination) is unverifiable. We denote ˆD = { ˆ Di = (Xi, ˆyi)}n i=1 the reported dataset where ˆyi = σi(Di) is the reporting strategy adopted by agent i. Specifically, each agent is characterized by a privacy cost coefficient ci R+. Higher ci indicates the agent i is concerned more about the privacy violation. Apart from the agents, there is an analyst who seeks an accurate estimator θ of θ based on the reported data and needs to construct a payment rule π : Dn Πn that encourages the truthful participation, i.e., reveal their data truthfully to the agent. Misreporting the response yi will result in a decrease in the payment received πi. The analyst will thus design a mechanism M takes the reported dataset ˆD = { ˆ Di = (Xi, ˆyi)}n i=1 as input and outputs an estimator θ of θ and a set of non-negative payments {πi}n i=1 for each agent. This mechanism will in turn satisfy some privacy guarantee for the reports provided by agents. Moreover, the desired mechanism must constrain the payments to an asymptotically small budget. All of the above discussion depends upon the rationality of each agent. 2.2 Differential Privacy Due to space limit, we postpone all the relevant definitions of DP to the appendix, readers can refer to Section. A.1. In our case, DP requires that all outputs, including the payments allocated to agents, are insensitive to each agent s input. This requirement is quite strict, as the payment to each agent is not shared publicly or with other agents. Therefore, instead of using the original DP, we consider a relaxation known as joint differential privacy (JDP) [31]. 2.3 Utility of Agents Based on the definition of JDP, we introduce in this section our model of agent utilities. We assume that each agent i is characterized by her privacy cost parameter ci R+ representing how she is concerned about the privacy violation in the case she truthfully reports yi to the analyst. We also introduce her privacy cost function fi(ci, ε, δ) which measures the cost she incurs when her response yi is used in an (ε, δ)-Joint Differential Private Mechanism. Considering the payment πi of agent i and her privacy cost function fi(ci, ε, δ), we denote by ui = πi fi(ci, ε, δ), her utility function that represents the utility she gets when she reports her response yi truthfully to the analyst. Following the previous works, we assume that all functions fi are bounded by a function of ci and ε, increasing with ε. Assumption 3. The privacy cost function of each agent i satisfies fi(ci, ε, δ) ci(1 + δ)ε3 Larger values of ε and δ imply weaker privacy guarantees, which means the privacy cost of an agent becomes larger. Thus, it is natural to let fi be bounded by a component-wise increasing function, which can be denoted by F(ci, ε, δ). Here F(ci, ε, δ) = ci(1 + δ)ε3. In [14], the authors consider the case where the response is bounded and δ = 0, they assume fi(ci, ε, δ) ciε2, i.e., F(ci, ε, δ) = ciε2. However, since our Assumption 2 on the distribution of the response is more relaxed, we need stronger assumptions regarding the privacy cost function and the distribution of privacy cost coefficients. We assume that the privacy cost parameters are also random variables, sampled from a distribution C. It is intuitive to believe that the privacy cost ci for each individual does not reflect the privacy costs incurred by other individuals. Also, we allow ci to be correlated with its corresponding data sample Di. Therefore, we make the following assumption. Assumption 4. Given Di, (D i, c i) is conditionally independent of ci, for each i [n] : p(D i, c i|Di, ci) = p(D i, c i|Di, c i) 3In order to make our result comparable to previous work on linear regression with bounded covariates, we assume the variance proxy is σ2/d so that with high probability xi 2 is bounded by some constant. where c i represents the set of privacy costs excluding the privacy cost of agent i. In addition, we make the same assumption on the tail of C as in [37] that the probability distribution of ci has exponential decay. This assumption is essential for providing a bound on the threshold value τα,β of truthful reporting, which will be explained in the following section. Assumption 5. There exists some constant λ > 0 such that the conditional distribution of the privacy cost coefficient satisfies inf Di Pci p(ci|Di)(ci τ) 1 e λτ. 2.4 Truthful Mechanism Properties Following the previous work, we propose to design mechanisms that satisfy the following properties : (1) truthful reporting is an equilibrium; (2) the private estimator output should be accurate; (3) the utilities of almost all agents are ensured to be non-negative; and (4) the total payment budget required from the analyst to run the mechanism is small. These concepts will be measured and evaluated using the framework of a multiagent, one-shot, simultaneous-move symmetric Bayesian game, where the behavior of the agents is modeled. Due to space limit, we provide rigorous definitions of the truthful mechanism properties to the Appendix. Readers can refer to Section A.2 for details. 3 Main Results In this section, we will design truthful and private mechanisms for our problem. Generally speaking, our method consists of two components: a closed-form JDP estimator for high dimensional sparse linear regression and a payment mechanism. In the following, we first introduce our private estimator for the original dataset D. 3.1 Novel Efficient Private Estimator As mentioned in Section 1, the one-round communication constraint necessitates a closed-form estimator. For this reason, in the low dimension setting, previous methods did not favour LASSO as there is no closed-form solution and thus also cannot be used. [14, 37] are motivated by the closed-form solution of the ordinary least square (OLS) or ridge regression (See Section C for detailed discussion). However, it is well-known that for high dimensional sparse setting these two estimators come with less favorable guarantees when compared to the LASSO. When employed as a regularization term, the ℓ2-norm does not encourage sparsity to the same extent as the ℓ1-norm. Thus, all previous methods for truthful linear regression cannot be applied to our problem. In the following, we derive a new estimator based on a variant of the classical OLS estimator, tailored for the high dimensional sparse setting. Our aim is twofold: first, to achieve convergence rates similar to the LASSO, and second, to exploit the inherent sparsity of the model. We initiate our analysis with an initial, albeit unrealistic "scaffolding" assumption that the inverse of the empirical covariance matrix (ˆΣXX) 1 exists with ˆΣXX = 1 n Pn i=1 xix T i , which will be removed in the course of our analysis. Intuitively, our goal is to find a sparse estimator that is close to OLS, i.e., arg minθ θ ˆΣ 1 XX ˆΣXY 2 2, s.t. θ 0 k, whose ℓ1 convex relaxation of the ℓ0 constraint is arg min θ θ ˆΣ 1 XX ˆΣXY 2 2 + λn θ 1 (1) with some λn > 0. Although (1) is very similar to LASSO, fortunately, the above minimizer is just the proximal operator on OLS: Proxλn 1(ˆΣ 1 XX ˆΣXY ), which has a closed form expression. Since the proximal operator is separable with respect to both vectors, θ and ˆΣ 1 XX ˆΣXY , we have (Proxλn 1(ˆΣ 1 XX ˆΣXY ))i = sgn((ˆΣ 1 XX ˆΣXY ))i) max{|(ˆΣ 1 XX ˆΣXY ))i| λn, 0}. And the optimal solution ˆθ satisfies: ˆθ = Sλn(ˆΣ 1 XX ˆΣXY ), (2) where for a given thresholding parameter λ, the element-wise soft-thresholding operator Sλ : Rd 7 Rd for any u Rd is defined as the following: the i-th element of Sλ(u) is defined as [Sλ(u)]i = sgn(ui) max(|ui| λ, 0). A key insight is that this operation is equivalent to embedding the classical OLS estimator within the sparsity constraint. This secures ℓ2-norm consistency which does not hold for neither the ridge regression nor the original OLS estimator. The ℓ1 regularization serves to minimize the structural complexity of the parameter under constraints. Importantly, the estimator above is available in closed-form. Our next focus centers on the privatization scheme. Prior work on truthful linear regression [14, 37] employed the output perturbation method, which adds some noise to the closed-form estimator. The noise should be calibrated by the ℓ2-sensitivities of their non-private estimators to ensure DP, which depend on Poly( p d/n). This is unacceptable when d n. Should we adopt the same strategy to analyze the sensitivity of Sλn(ˆΣ 1 XX ˆΣXY ), we inevitably encounter the estimation error of ˆΣ 1 XX. This error will cause the noise to have an unavoidable dependency on Poly( p d/n) [46]. Given the previous discussion, we propose to inject Gaussian noise separately to ˆΣXX and ˆΣXY . It is notable that while similar methods have been used in DP statistical estimation [40, 48], it has not been used in truthful linear regression. This is due to that in our problem, the data is misreported, which makes the utility analysis more difficult. We will discuss it in Section 4.1. For the term ˆΣXX, we apply ℓ2-norm clipping so that the sensitivity of clipped version ˆΣ X X is irrelevant to d. For the term ˆΣXY , since the covariate and the response yi are sub-Gaussian. Therefore, clipping operation becomes necessary. We shrink each coordinate of xi via parameters τx, i.e., for j [d] and xi = sgn (xi) min {|xi| , τx}. We also perform the element-wise clipping on the response yi by τy. Then the server aggregates these terms xi x T i and xi yi and separately add Gaussian noises to ˆΣ X X and ˆΣ e X e Y . The noisy version is denoted by Σ X X and Σ e X e Y . We now give a second thought to the estimation of the covariance matrix. As mentioned earlier, we used the scaffolding assumption on the invertibility of ˆΣXX matrix, which does not hold as the matrix is rank-deficient if d n. To mitigate this issue, we need to impose additional assumptions and here switch to the sparsity assumption introduced below on the structure of the covariance matrix, which has been widely studied previously such as [4, 5]. Assumption 6. We assume that Σ Gq(s) for some 0 q < 1, where Gq(s) = {Σ = (σxx T ,ij)1 i,j d : maxi Pd j=1 |σxx T ,ij|q s, j [d]} is the parameter space of s-approximately sparse covariance matrices. It is notable that the sparse covariance assumption is commonly adopted in the previous work on high dimensional estimation [60, 9]. To the best of our knowledge, the only work that considers the private sparse covariance matrix estimation is [51]. Unfortunately, [51] only considers the case where q = 0, which is a special case of Assumption 6 and we cannot trivially apply their method to our setting. Also, as we will discuss in Remark 3, even if the assumption does not hold, all of our theoretical results still hold if the sample size is large enough. Directly using the perturbed covariance matrix will be insufficient to exploit the sparsity structure. In fact, it can be readily seen that Σ X X Σ 2 O( d nϵ ), which is quite large under the high dimensional setting. To see why Σ X X will introduce a large error under Assumption 6, we observe that some of its entries are quite large which makes | σ x x T ,ij σxx T ,ij| large for some i, j. Thus, we need to develop a new private estimator for (approximately) sparse covariance matrices as a valuable by-product. By Lemma 24 and 11 in the Appendix, we can get the following, with high probability, for all 1 i, j d, σ x x T ,ij σxx T ,ij O( δ nε ). However, under the sparsity assumption, there will be two cases: (1) If σxx T ,ij is small enough, then maybe the zero entry could have a smaller error than σ x x T ,ij since the noise is quite large. (2) If σxx T ,ij is large, then the original σ x x T ,ij could be better. Motivated by the above observation, we perform a hard-thresholding operation on each σ x x T ,ij. This method takes advantage of this sparsity assumption by first estimating the sample covariance matrix, and then setting all entries with absolute values below a certain threshold Thres to 0. To be more specific, it filters out σ x x T ,ij smaller than Thres and sets them to 0 while keeping those larger than Thres unchanged. This effectively shrinks the magnitude of the perturbed covariance matrix and thus lowering the error since some small σxx T ,ij correspond to large σ x x T ,ij. After the hard-thresholding operation, we denote the resulting matrix by Σ X X and it is invertible with high probability as shown below. Algorithm 1 (ε, δ)-DP Algorithm for Sparse Linear Regression 1: Input: Private data {(xi, yi)}n i=1 Rd R n. Predefined parameters r, τx, τy, λn. 2: for each user i [n] do 3: Clip xi = xi min {1, r/ xi 2} , i.e. xi = Πr(xi). Release xi x T i to the server. 4: Clip xi := sgn (xi) min {|xi| , τx} and yi := sgn (yi) min {|yi| , τy}. Release xi yi to the server. 5: end for 6: The server aggregates ˆΣ X X = 1 n Pn i=1 xi x T i and ˆΣ e X e Y = 1 n Pn i=1 xi yi. 7: Add noise Σ X X = ˆΣ X X + N1, where N1 Rd d is a symmetric matrix and each entry of the upper triangular matrix is sampled from N(0, 32r4 log 2.5 δ n2ε2 ). Add noise Σ e X e Y = ˆΣ e X e Y + N2, where the vector N2 Rd is sampled from N(0, 32τ 2 xτ 2 y log 2.5 δ n2ε2 Id). 8: Apply hard-thresholding to Σ X X and obtain Σ X X where each entry is defined as σ x x T ,ij = σ x x T ,ij I[| σ x x T ,ij| > Thres], where Thres = γ q 2 ln 1.25/δ log d nε and γ is some constant. 9: The server outputs ˆθP (D) = Sλn([ Σ X X] 1 Σ e X e Y ). Lemma 1. The estimation error of private estimator Σ X X satisfies with probability 1 d Ω(1) that E Σ X X Σ 2 O s2 log d log 1 1 q + log d log 1 where the expectation takes over the randomness of the data records and the algorithm. Remark 1. When q = 0 and r = O(1), Lemma 1 could achieve an error bound of O( s2 log d nϵ2 ), which matches the optimal rate of locally differentially private sparse covariance matrix estimation [51, 50]. Our private estimator is of the form ˆθP (D) = Sλn( Σ 1 X X Σ e X e Y ). With some r, τx, and τy, ˆθP (D), upper bound of O( k nε) is achievable. See Algorithm 1 for details. Notably, step 8 is the hard thresholding step for the private covariance matrix estimator. Importantly, the threshold only depends on n, log d, and the privacy parameters and is independent on the two sparsities s and k of our problem. Moreover, since we assume θ 2 τθ, we can also project ˆθP (D) in Algorithm 1 onto a ℓ2-norm ball: θP (D) = Πτθ(ˆθP (D)), where Πτθ(v) = arg minv B(τθ) v v 2 2 and B(τθ) is the closed ℓ2-norm ball with radius τθ. To sum up, high dimensionality gives rise to several consequences: (1) the regularization techniques used by [37, 14] are not applicable, (2) the invertibility of covariance matrix estimate is not guaranteed, (3) it also precludes the output perturbation method. Our proposed estimator properly overcomes the above challenges. To tackle (1) we embed the OLS estimator within the ℓ1 constraint to exploit the sparsity, along with a novel estimator of the covariance matrix to mitigate (2), and privatize it by sufficient perturbation to solve (3), which in turn makes our truthful analysis part much more complicated. 3.2 Payment Rule We now turn our attention to the payment rule. The analyst wants to pay agents according to the veracity of the data they have provided and needs a reference against which to compare each data item. As mentioned before, the response is unverifiable, hence we lack a ground truth reference to validate the accuracy of the reported values. To circumvent the problem, we adopt the peer prediction method to determine a player s payment. In principle, the higher payment means higher consistency between the agent s reported value ˆyi and the predicted value of yi estimated using the collective input from their peers ˆD i. To quantitatively measure the similarity between each agent s report and their peer s reports, we will use the rescaled Brier score rule[21]. Let a1 and a2 be positive parameters to be specified. Consider that q represents the prediction of agent i s response based on her own reports, and p represents the prediction of agent i s response based on her feature vector and her peers reports. The analyst uses Algorithm 2 General Framework for Truthful and Private High Dimensional Linear Regression 1: Ask all agents to report their data ˆD1, , ˆDn; 2: Randomly partition agents into two groups, with respective data pairs ˆD0, ˆD1 3: For each dataset ˆD, ˆD0 and ˆD1, compute private estimators ˆθP ( ˆD), ˆθP ( ˆDb) for b = 0, 1 according to Algorithm 1 4: Compute estimators θP ( ˆD) = Πτθ(ˆθP ( ˆD)) and θP ( ˆDb) = Πτθ(ˆθP ( ˆDb)) for b = 0, 1 5: Set parameters a1, a2, and compute payments to each agent i: if agent i s is in group 1 b, then he will receive payment πi = Ba1,a2 xi, θP ( ˆDb) , xi, Eθ p(θ| ˆ Di)[θ] ) . the following payment rule: Ba1,a2(p, q) = a1 a2(p 2pq + q2), (3) Observing that Ba1,a2(p, q) is a function that exhibits strict concavity with respect to q, we point out that its maximum is attained when q equals p. This implies a congruence between the prediction of agent i s response based on her own information and that derived from her peers information. For agent i, given E[yi|xi, θ ] = xi, θ , it is logical to set p = xi, θP ( ˆDb) and q = xi, Eθ p(θ| ˆ Di)[θ] . Here, θP ( ˆDb) denotes the private estimator for a dataset ˆDb excluding ˆDi, and p(θ| ˆDi) represents the posterior distribution of θ post the analyst s receipt of ˆDi. Building upon this analysis, we structure our Algorithm 2. This algorithm utilizes reported data, which may contain manipulated responses, to generate estimators. To maintain data independence, we divide the dataset into two subsets, ˆD0 and ˆD1. For the purpose of calculating the payment for each agent i in group b {0, 1}, the estimator θ is derived using ˆD1 b. Finally, the algorithm applies θP ( ˆD) in combination with the specific agent s feature vector xi to forecast their response. 4 Theoretical Results and Implementation 4.1 Accuracy and Privacy Analysis Theorem 2 (Privacy). The output of Algorithm 2 satisfies (2ε, 3δ)-JDP. Remark 2. By the selection of ε and δ, our mechanism could achieve an (o( 1 n), O(n Ω(1)))-JDP, which provides asymptotically the same good privacy guarantee as in [14] for the low dimension setting with bounded covariates and responses. Specifically, [14] shows that it is possible to design an o( 1 n)-JDP mechanism, while here we consider the approximate JDP due to the Gaussian noise we add. [37] considers truthful (low dimensional) linear regression with sub-Gaussian/heavy-tailed covariates and responses, our Theorem 2 is better than theirs. In detail, [37] can only guarantee Random Joint Differential Privacy (RJDP) where on all but a small proportion of unlikely dataset pairs, pure ε-JDP holds, while in this paper we can guarantee approximate JDP, which is more widely used in the DP literature. The main reason is that [37] used the output perturbation-based method to ensure DP. However, as the data distribution is sub-Gaussian rather than bounded as in [14], the sensitivity of the closed-form linear regression estimator could be extremely large (with some probability). Thus, the output perturbation-based method can fail with a small probability and can only ensure RJDP. Instead, in our method, we use sufficient statistics perturbation, due to our clipping operator, the sensitivities of ˆΣ X X and ˆΣ e X e Y are always bounded. One consequence of adopting our sufficient statistics perturbation method is that the utility analysis is harder than that of the output perturbation-based method, as we will discuss in the following. Theorem 3 (Accuracy). Fix a privacy parameter ε, a participation goal 1 α and a desired confidence parameter β in Definition 6. Then under the symmetric threshold strategy στα,β, when n is sufficient large such that n Ω( s2r4 log d log 1 δ ε2κ ), the output θP ( ˆD) of Algorithm 2 satisfies that with probability at least 1 β O(n Ω(1)), E[ θP ( ˆD) θ 2 2] O α2n + 1 kr4log d log 1 Remark 3. The above bound is independent on Poly(d). This outcome is primarily due to the use of sufficient statistics perturbation method, which effectively reduces the size of noise. It is notable that Assumption 6 only ensures that when n Ω( s2r4 ε2κ ) our private covariance estimator is invertible. Thus, even if the assumption does not hold, we can still get the same upper bound as in Theorem 3 as long as n Ω(d). Our framework for analyzing the accuracy differs dramatically from that of the prior ones since our privatization mechanism incurs finer and more delicate analysis on some specific error terms. The work of [37] and [14] both used the output perturbation method, as it provides an explicit characterization of the noise. Specifically, their estimator can be represented as θP ( ˆD) = Πτθ( θP ( ˆD)), where θP ( ˆD) = θ( ˆD) + v with θ( ˆD) as a (non-private) closed-form estimator and v is the added Gamma noise of scale O( p d/n) to the non-private estimator. It can be shown that: E[ θP ( ˆD) θ 2 2] E θP ( ˆD) θ 2 2 2E θP ( ˆD) ˆθ(D) 2 2 + 2E ˆθ(D) θ 2 2 =2(E θ( ˆD) ˆθ(D) 2 2 + E v 2 2 + E ˆθ(D) θ 2 2) (4) where D is original data and ˆD is the reported data. Note that the second and the third terms in (4) are easy to upper bound. For the first term, since we know the number of manipulated data in ˆD is bounded by αn with high probability, indicating there are at most αn different samples between ˆD and D. Thus, it can be bounded by directly re-using the sensitivity analysis of θ(D) as a part of the privacy data analysis. However, as we mentioned previously, the sensitivity of (2) is of scale O( p d/n), indicating the previous method cannot be applied in our setting. Hence our tactic is to privatize the terms ˆΣ X X and ˆΣ e X e Y separately. We shall take another route to estimate the error of the private estimator θP ( ˆD): E[ θP ( ˆD) θ 2 2] E ˆθP ( ˆD) θ 2 2 2E ˆθP ( ˆD) ˆθP (D) 2 2 + 2E ˆθP (D) θ 2 2. The above framework is now estimating the difference in the private ˆθP (instead of the non-private ˆθ). However, this nuance inevitably leads to more complex analysis, i.e. the term E ˆθP ( ˆD) ˆθP (D) 2 2 is much more complicated to deal with than the first term in (4). The bound on E ˆθP ( ˆD) ˆθP (D) 2 2 cannot be obtained by simply applying existing results on the covariance matrix estimation as in [37] or the assumption of strong convexity of the loss function as in [14]. Specifically, we tackle this issue by combining the analysis for s-approximately sparse covariance matrices and some technical tools such as the decomposability of the ℓ1 norm term in (1) to give a non-trivial bound. The relevant lemma is given below and as mentioned this bound comprises a key component for the proof of Theorem 3. Lemma 4. Let ˆθP ( ˆD) and ˆθP (D) be the private estimators on the original dataset D and the reported dataset ˆD that at most one agent misreports. We set the constraint bound λn = O r2 log d log 1 when n is sufficient large such that n Ω( s2r4 log d log 1 δ ε2κ ), with probability at least 1 O(d Ω(1)) we have the following error bounds: ˆθP ( ˆD) ˆθP (D) 4λn, ˆθP ( ˆD) ˆθP (D) 2 16 4.2 Analysis for the Properties of Truthful Mechanism Theorem 5 (Truthfulness). Fix a privacy parameter ε, a participation goal 1 α and a desired confidence parameter β in Definition 6. Then with probability at least 1 β O(n Ω(1)) nd 9 2 , the symmetric threshold strategy στα,β is an η-approximate Bayesian Nash equilibrium in Algorithm 2 with η = O a2r4 log d log 1 + τα,βδε3 . We can see with some specified parameters, the above bound could be sufficiently small. For example, when we set α = O( 1 n) we have η 0 as n since δ = o( 1 Theorem 6 (Individual rationality). With probability at least 1 β O(n Ω(1)) nd 9 2 , Algorithm 2 is individually rational for all agents with cost coefficients ci τα,β as long as a1 a2(rτθ + 3r2τ 2 θ ) + τα,β8(1 + 3δ)ε3 regardless of the reports from agents with cost coefficients above τα,β, where a1, a2 are in (6). Remark 4. Our mechanism may not be individually rational for all agents, since the privacy cost coefficients follow an unbounded distribution with exponential decay. This assumption is made reasonable because for a relatively small subset of agents the privacy costs could be exceptionally high, indicating certain individuals may persistently refrain from reporting truthfully, regardless of the compensation offered to them. The same situation happens in other truthful mechanisms as well. Theorem 7 (Budget). With probability at least 1 β O(n Ω(1)) nd 9 2 , the total expected budget B = Pn i=1 E (πi) required by the analyst to run Mechanism 1 under threshold equilibrium strategy στα,β satisfies B n(a1 + a2(rτθ + r2τ 2 θ )). Remark 5. With some appropriate setting of α, it is reasonable to expect the overall expected budget to diminish toward zero as the sample size increases. This is because if the sample size is infinite, the ground-truth parameters can always recovered by our estimator, which allows the analyst to pay nothing to incentivize the agents. 4.3 Formal Statement of Main Result Corollary 8. For any ξ ( 1 2) and c > 0, we set ε = n ξ, δ = Θ(n Ω(1)), α = Θ(n 3ξ), β = Θ(n c), a2 = O(n 3ξ), a1 = a2(rτθ + 3r2τ 2 θ ) + τα,β8(1 + 3δ)ε3. Then the output of Algorithm 2 satisfies (O(n ξ), O(n Ω(1)))-JDP. Moreover, with probability at least 1 O(n Ω(1)), it holds that: 1. The symmetric threshold strategy στα,β is a e O(n ξ 1)-Bayesian Nash equilibrium for a 1 O(n 3ξ) fraction of agents to truthfully report their data; 2. The private estimator θP ( ˆD) is e O(n2ξ 1)-accurate; 3. It is individually rational for a 1 O(n 3ξ) fraction of agents to take part in the mechanism; 4. The total expected budget required by the analyst is e O(n1 3ξ). Remark 6. It is important to highlight the subtle balance between the precision of the estimator and the other attributes of the mechanism. Compromising on accuracy often results in several consequences: (1) a reduction in the total compensation paid to the agents, (2) an increase in the proportion of rational agents, and (3) a closer approximation to the Bayesian Nash Equilibrium. However, we do not observe such a trade-off in [37] s implementation for linear regression cases. Besides, our results slightly differ from that of [14] in the sense that they do not have a trade-off between rationality and accuracy. Such discrepancies may be caused by varying willingness to supply higher compensation to the agents, resulting in different settings of parameters. 5 Conclusions In this paper, we fit the high dimensional sparse linear regression privately and we propose a truthful mechanism to incentivize the strategic users. On the one hand, the mechanism is (o(1), O(n Ω(1)))- jointly differentially private. This is achieved by using the sufficient statistics perturbation method which adds much less noise than the output perturbation method employed by prior work. Leveraging the idea of soft-thresholding, we propose a private estimator of θ to exploit the sparsity of the model and it achieves an error of o(1). On the other hand, via some computation on the consistency between the agent s reported value and the predicted value using peers reports, we design an appropriate payment scheme for each agent using rescaled Brier score rule. This method ensures that our mechanism reaches an o( 1 n)-approximate Bayes Nash equilibrium for a (1 o(1))-fraction of agents to truthfully report their data while a (1 o(1)) fraction of agents receive non-negative utilities, thus establishing their individual rationality. Moreover, the total payment budget required from the analyst to run the mechanism is o(1). Acknowledgments Di Wang and Liyang Zhu were supported in part by the funding BAS/1/1689-01-01, URF/1/466301-01, REI/1/5232-01-01, REI/1/5332-01-01, and URF/1/5508-01-01 from KAUST, and funding from KAUST - Center of Excellence for Generative AI, under award number 5940. Jinyan Liu is supported by the National Natural Science Foundation of China (NSFC Grant No. 62102026). [1] Raef Bassily, Cristóbal Guzmán, and Michael Menart. Differentially private stochastic optimization: New results in convex and non-convex settings. Advances in Neural Information Processing Systems, 34:9317 9329, 2021. [2] Raffael Bild, Klaus A Kuhn, and Fabian Prasser. Safepub: A truthful data anonymization algorithm with strong privacy guarantees. Proc. Priv. Enhancing Technol., 2018(1):67 87, 2018. [3] T. Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. The Annals of Statistics, 49(5):2825 2850, October 2021. Publisher: Institute of Mathematical Statistics. [4] T. Tony Cai and Harrison H. Zhou. Optimal rates of convergence for sparse covariance matrix estimation. The Annals of Statistics, 40(5), October 2012. ar Xiv:1302.3030 [math, stat]. [5] Tony Cai and Weidong Liu. Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association, 106(494):672 684, 2011. [6] Yang Cai, Constantinos Daskalakis, and Christos Papadimitriou. Optimum statistical estimation with strategic data sources. In Conference on Learning Theory, pages 280 296. PMLR, 2015. [7] Robert J Casson and Lachlan DM Farmer. Understanding and checking the assumptions of linear regression: a primer for medical researchers. Clinical & experimental ophthalmology, 42(6):590 596, 2014. [8] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(3), 2011. [9] Junren Chen, Cheng-Long Wang, Michael K Ng, and Di Wang. High dimensional statistical estimation under uniformly dithered one-bit quantization. IEEE Transactions on Information Theory, 2023. [10] Yiling Chen, Stephen Chong, Ian A Kash, Tal Moran, and Salil Vadhan. Truthful mechanisms for agents that value privacy. ACM Transactions on Economics and Computation (TEAC), 4(3):1 30, 2016. [11] Yiling Chen, Chara Podimata, Ariel D Procaccia, and Nisarg Shah. Strategyproof linear regression in high dimensions. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 9 26, 2018. [12] Yiling Chen, Yiheng Shen, and Shuran Zheng. Truthful data acquisition via peer prediction. Advances in Neural Information Processing Systems, 33:18194 18204, 2020. [13] Rachel Cummings, Hadi Elzayn, Emmanouil Pountourakis, Vasilis Gkatzelis, and Juba Ziani. Optimal data acquisition with privacy-aware agents. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (Sa TML), pages 210 224. IEEE, 2023. [14] Rachel Cummings, Stratis Ioannidis, and Katrina Ligett. Truthful linear regression. In Conference on Learning Theory, pages 448 483. PMLR, 2015. [15] Rachel Cummings, Katrina Ligett, Aaron Roth, Zhiwei Steven Wu, and Juba Ziani. Accuracy for sale: Aggregating data with a variance constraint. In Proceedings of the 2015 conference on innovations in theoretical computer science, pages 317 324, 2015. [16] Ofer Dekel, Felix Fischer, and Ariel D Procaccia. Incentive compatible regression learning. Journal of Computer and System Sciences, 76(8):759 777, 2010. [17] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265 284. Springer, 2006. [18] Alireza Fallah, Ali Makhdoumi, Azarakhsh Malekian, and Asuman Ozdaglar. Optimal and differentially private data acquisition: Central and local mechanisms. ar Xiv preprint ar Xiv:2201.03968, 2022. [19] Alireza Fallah, Ali Makhdoumi, Asuman Ozdaglar, et al. Bridging central and local differential privacy in data acquisition mechanisms. Advances in Neural Information Processing Systems, 35:21628 21639, 2022. [20] Lisa K Fleischer and Yu-Han Lyu. Approximately optimal auctions for selling privacy when costs are correlated with data. In Proceedings of the 13th ACM conference on electronic commerce, pages 568 585, 2012. [21] Arpita Ghosh, Katrina Ligett, Aaron Roth, and Grant Schoenebeck. Buying private data without verification. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 931 948, 2014. [22] Arpita Ghosh and Aaron Roth. Selling privacy at auction. In Proceedings of the 12th ACM conference on Electronic commerce, pages 199 208, 2011. [23] Katherine Godfrey. Simple linear regression in medical research. New England Journal of Medicine, 313(26):1629 1636, 1985. [24] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015. [25] Thibaut Horel, Stratis Ioannidis, and S Muthukrishnan. Budget feasible mechanisms for experimental design. In Latin American Symposium on Theoretical Informatics, pages 719 730. Springer, 2014. [26] Justin Hsu, Zhiyi Huang, Aaron Roth, Tim Roughgarden, and Zhiwei Steven Wu. Private matchings and allocations. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 21 30, 2014. [27] Lijie Hu, Ivan Habernal, Lei Shen, and Di Wang. Differentially private natural language models: Recent advances and future directions. In Findings of the Association for Computational Linguistics: EACL 2024, pages 478 499, 2024. [28] Lijie Hu, Shuo Ni, Hanshen Xiao, and Di Wang. High dimensional differentially private stochastic optimization with heavy-tailed data. In Proceedings of the 41st ACM SIGMODSIGACT-SIGAI Symposium on Principles of Database Systems, pages 227 236, 2022. [29] Stratis Ioannidis and Patrick Loiseau. Linear regression as a non-cooperative game. In Web and Internet Economics: 9th International Conference, WINE 2013, Cambridge, MA, USA, December 11-14, 2013, Proceedings 9, pages 277 290. Springer, 2013. [30] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for high-dimensional m-estimation. Advances in neural information processing systems, 27, 2014. [31] Michael Kearns, Mallesh Pai, Aaron Roth, and Jonathan Ullman. Mechanism design in large games: Incentives and privacy. In Proceedings of the 5th conference on Innovations in theoretical computer science, pages 403 410, 2014. [32] Katrina Ligett and Aaron Roth. Take it or leave it: Running a survey when privacy comes at a cost. In International workshop on internet and network economics, pages 378 391. Springer, 2012. [33] Frank D Mc Sherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 19 30, 2009. [34] Kobbi Nissim, Claudio Orlandi, and Rann Smorodinsky. Privacy-aware mechanism design. In Proceedings of the 13th ACM conference on electronic commerce, pages 774 789, 2012. [35] Javier Perote Peña and Juan Perote Peña. Strategy-proof estimators for simple regression. Documento de trabajo, page 14, 2003. [36] Dudley L Poston, Dudley L Poston Jr, Eugenia Conde, and Layton M Field. Applied Regression Models in the Social Sciences. Cambridge University Press, 2023. [37] Yuan Qiu, Jinyan Liu, and Di Wang. Truthful Generalized Linear Models, September 2022. ar Xiv:2209.07815 [cs]. [38] Yuan Qiu, Jinyan Liu, and Di Wang. Truthful and privacy-preserving generalized linear models. Information and Computation, 301:105225, 2024. [39] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over ℓq-balls. IEEE transactions on information theory, 57(10):6976 6994, 2011. [40] Adam Smith, Abhradeep Thakurta, and Jalaj Upadhyay. Is Interaction Necessary for Distributed Private Learning? In 2017 IEEE Symposium on Security and Privacy (SP), pages 58 77, San Jose, CA, USA, May 2017. IEEE. [41] Shuang Song, Thomas Steinke, Om Thakkar, and Abhradeep Thakurta. Evading the curse of dimensionality in unconstrained private glms. In International Conference on Artificial Intelligence and Statistics, pages 2638 2646. PMLR, 2021. [42] Gilbert W Stewart and Ji-guang Sun. Matrix perturbation theory. Academic press, 1990. [43] Jinyan Su, Lijie Hu, and Di Wang. Faster rates of differentially private stochastic convex optimization. Journal of Machine Learning Research, 25(114):1 41, 2024. [44] Jinyan Su, Changhong Zhao, and Di Wang. Differentially private stochastic convex optimization in (non)-euclidean space revisited. In Uncertainty in Artificial Intelligence, pages 2026 2035. PMLR, 2023. [45] Youming Tao, Yulian Wu, Xiuzhen Cheng, and Di Wang 0015. Private stochastic convex optimization and sparse learning with heavy-tailed data revisited. In IJCAI, pages 3947 3953, 2022. [46] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010. [47] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices, November 2011. ar Xiv:1011.3027 [cs, math]. [48] Di Wang, Lijie Hu, Huanyu Zhang, Marco Gaboardi, and Jinhui Xu. Generalized linear models in non-interactive local differential privacy with public data. Journal of Machine Learning Research, 24(132):1 57, 2023. [49] Di Wang and Jinhui Xu. On sparse linear regression in the local differential privacy model. In International Conference on Machine Learning, pages 6628 6637. PMLR, 2019. [50] Di Wang and Jinhui Xu. Tight lower bound of sparse covariance matrix estimation in the local differential privacy model. Theoretical computer science, 815:47 59, 2020. [51] Di Wang and Jinhui Xu. Differentially private high dimensional sparse covariance matrix estimation. Theoretical Computer Science, 865:119 130, 2021. [52] Di Wang and Jinhui Xu. On Sparse Linear Regression in the Local Differential Privacy Model. IEEE Transactions on Information Theory, 67(2):1182 1200, February 2021. [53] Di Wang and Jinhui Xu. Differentially private ℓ1-norm linear regression with heavy-tailed data. In 2022 IEEE International Symposium on Information Theory (ISIT), pages 1856 1861. IEEE, 2022. [54] Di Wang and Jinhui Xu. Gradient complexity and non-stationary views of differentially private empirical risk minimization. Theoretical Computer Science, 982:114259, 2024. [55] Di Wang, Minwei Ye, and Jinhui Xu. Differentially private empirical risk minimization revisited: Faster and more general. Advances in Neural Information Processing Systems, 30, 2017. [56] Peter Whittle. Bounds for the moments of linear and quadratic forms in independent variables. Theory of Probability & Its Applications, 5(3):302 305, 1960. [57] David Xiao. Is privacy compatible with truthfulness? In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 67 86, 2013. [58] Hanshen Xiao, Zihang Xiang, Di Wang, and Srinivas Devadas. A theory to instruct differentiallyprivate learning via clipping bias reduction. In 2023 IEEE Symposium on Security and Privacy (SP), pages 2170 2189. IEEE, 2023. [59] Eunho Yang, Aurelie Lozano, and Pradeep Ravikumar. Elementary estimators for highdimensional linear regression. In International Conference on Machine Learning, pages 388 396. PMLR, 2014. [60] Eunho Yang, Aurélie C Lozano, and Pradeep K Ravikumar. Closed-form estimators for highdimensional generalized linear models. Advances in Neural Information Processing Systems, 28, 2015. [61] Liyang Zhu, Meng Ding, Vaneet Aggarwal, Jinhui Xu, and Di Wang. Improved analysis of sparse linear regression in local differential privacy model. ar Xiv preprint ar Xiv:2310.07367, 2023. A Notations and Omitted Relevant Definitions Notation Description n The number of agents d The dimensionality xi X Rd The feature vector of agent i yi Y R The response variable of agent i ΣXX or Σ The covariance matrix of xi X p p-norm of matrix X I[A] Indicator of event A sgn(x) Sign function of a real number x [Sλ(u)]i Element-wise soft-thresholding operator N1 Rd d Gaussian noise added to ˆΣ X X N2 Rd Gaussian noise added to ˆΣ e X e Y γ Some constant associated with the variance of N2 yi shrunken version of yi via parameter τy ˆΣ X X(ˆΣ e X e Y ) The sample covariance between X and X (The sample covariance between the clipped X and the clipped Y ) Σ X X ( Σ e X e Y ) a noisy and (clipped) version of ˆΣ X X ( ˆΣ e X e Y ) ci C R+ The privacy cost coefficient of agent i D = X Y The feature-response data universe Di = (xi, yi) The feature-response data pair of agent i σi( , ) : D C Y The reporting strategy of agent i ˆyi = σi (Di, ci) The reported response of agent i ˆDi = (xi, ˆyi) The reported feature-response data pair of agent i D = {Di}n i=1 All feature-response data pairs ˆD = n ˆDi on i=1 All reported feature-response data pairs σ = {σi}n i=1 The collection of all agents reporting strategies D i = D\Di The collection of data pairs except Di ˆD i = ˆD\ ˆDi The collection of reported data pairs except ˆDi σ i = σ\σi The collection of strategies except σi c i = {cj}n j=1 \ci The collection of privacy cost coefficients except ci πi The payment to agent i ui The utility of agent i ε, δ R+ The privacy parameters of the mechanism θ Rd The model parameter needs to estimate στ The threshold strategy with privacy cost threshold τ R+ τα,β The privacy cost threshold characterized by α, β (0, 1) B The expected total budget ˆθ(D) The non-private estimator on data D ˆθP ( ˆD) The private estimator on the reported data ˆD θP ( ˆD) The private estimator on the reported data ˆD after projection κ2, κ The lower bounds for the operator norm and infinity norm of matrix Σ respectively η The level of truthfulness ξ ( 1 2) The parameter to control the trade-off between accuracy and truthfulness, payment amount and individual rationality. A.1 Definitions of Differential Privacy Definition 1 (Differential Privacy [17]). Given a data universe X, we say that two datasets D, D X are neighbors if they differ by only one entry, which is denoted as D D . A randomized algorithm A is (ε, δ)-differentially private (DP) if for all neighboring datasets D, D and for all events S in the output space of A, we have P(A(D) S) eεP(A(D ) S) + δ. Definition 2. (Gaussian Mechanism). Given any function q : X n Rp, the Gaussian Mechanism is defined as: MG(D, q, ε) = q(D) + Y, where Y is drawn from Gaussian Distribution N 0, σ2Ip 2 ln(1.25/δ) 2(q)/ε. Here 2(q) is the ℓ2-sensitivity of the function q, i.e. 2(q) = sup D D q(D) q (D ) 2. Gaussian Mechanism preserves (ε, δ)-differential privacy. Definition 3 (Joint Differential Privacy [31]). Consider a randomized mechanism M : Dn Θ Πn with arbitrary response sets Θ, Πn. For each i [n], we denote by M( ) i = (θ, π i) Θ Πn 1 the portion of the mechanism s output that is observable by external observers and agents j = i. a mechanism M preserves (ε, δ)- joint differential privacy (JDP), at privacy level ε > 0 and confidence level δ (0, 1), if for every agent i, every dataset D Dn and every Di, D i D we have S Θ Πn 1, P (M(Di, D i) i S|(Di, D i)) eεP (M(D i, D i) i S|(D i, D i)) + δ, where D i Dn 1 is the dataset D that excludes the i-th sample in D and π i is the vector that contains all payments except the payment of agent i. An (ε, δ)-JDP mechanism on any dataset (including likely ones) may leak sensitive information on low probability responses, forgiven by the additive δ relaxation, which we hope to have a magnitude of o( 1 n). Definition 3 assumes that the private estimator θ produced by the mechanism M is publicly observable, and that the payments πi can only be seen by agent i. Thus, from the perspective of each agent i, the released public output (θ, π i) may compromise their privacy. A.2 Truthful mechanism properties Consider M the regression mechanism that takes as input the reported data ˆD = { ˆ Di = (Xi, ˆyi)}n i=1 from n agents, with ˆyi = σi(Di) the reporting strategy adopted by agent i. We denote the set of all agents strategies by the strategy profile σ = (σ1, ..., σn) and σ i = (σ1, ..., σi 1, σi+1, ..., σn) denotes the set of strategies excluding the reporting strategy of agent i. Let fi(ci, ε, δ) be the privacy cost function of agent i with ci her cost parameter. Each agent i receives a real-valued payment πi and finally receives utility ui = πi fi(ci, ε, δ). Based on these settings, we quantify property (1) by introducing Bayesian Nash equilibrium. Definition 4 (η-Bayesian Nash Equilibrium). A strategy profile σ = (σ1, ..., σn) forms an η-Bayesian Nash Equilibrium if for every agent i, Di and ci, and for every reporting strategy σ i = σi, one have : ED i,c i p(D i,c i|Di,ci)[ui(σi(Di, ci), σ i(D i, c i))] ED i,c i p(D i,c i|Di,ci)[ui(σ i(Di, ci), σ i(D i, c i))] η. The positive value η represents the maximum of additional expected payment an agent might receive by altering their reporting strategy. To encourage agents to report their data truthfully, the payment rule should aim to keep η as minimal as possible. This paper investigates a specific threshold strategy. We will establish that if all agents consistently employ the threshold strategy with a common positive value τ, this unified strategy profile will effectively realize an η-Bayesian Nash equilibrium. Definition 5 (Threshold Strategy). The threshold strategy στ is defined as follows: ˆyi = στ(xi, yi, ci) = yi, if ci τ, arbitrary value in Y, if ci > τ. Definition 6 (Threshold τα,β). Fix a probability density function p(c) of privacy cost parameter, and let τ 1 α,β = inf{τ > 0 : P(c1, ,cn) pn(#{i : ci τ} (1 α)n) 1 β}, τ 2 α = inf{τ > 0 : inf Di Pcj p(c|Di)(cj τ) 1 α}. Define τα,β as the larger of these two thresholds: τα,β = max{τ 1 α,β, τ 2 α}. τ 1 α,β is a threshold that with probability at least 1 β, at least 1 α fraction of agents have cost coefficient ci τα,β. τ 2 α is a threshold that conditioned on their own dataset Di, each agent i believes that with probability 1 α any other agent j has cost coefficient cj τα,β. We introduce the definition of η-accuracy and we use the l2-norm distance between the private estimator and the true parameter. Definition 7 (η-accuracy). A mechanism is η-accurate if its output θP satisfies E θP θ 2 2 η. Definition 8 (Individual Rationality). A mechanism is individually rational if the utility received by each agent i has a non-negative expectation: i [n], E[ui] 0 To satisfy individual rationality, the mechanism should output payments high enough to compensate for privacy costs. Meanwhile, we also expect a total payment budget B = Pn i=1 E[πi] = o(1) that tends to zero as the number of agents n increases. Definition 9 (Asymptotically Small Budget). An asymptotically small budget is such that B = Pn i=1 E[πi] = o(1) for all realizable dataset D = {(xi, yi)}n i=1. To sum up, we have n agents reporting potentially manipulated data, denoted as ˆD = (xi, yi), to an analyst. Subsequently, the analyst computes an estimator for the model parameters using the reported data. To preserve DP, the analyst meticulously introduces controlled noise into the estimator. Furthermore, the analyst strategically compensates the agents for their privacy costs based on their reports, with the goal of incentivizing truthful reporting. This incentivization is implemented under the assumption that the agents will adhere to a threshold strategy. B Related Work Following the pioneering work of [22], a substantial body of research has explored data acquisition problems involving agents with privacy concerns [20, 32, 57, 15, 21, 18, 19, 13]. The majority of this work operates in a model where agents cannot fabricate their private information; their only recourse is either to withhold it or, at most, misrepresent their costs related to privacy. A related thread [22, 34, 10] explores cost models based on the notion of differential privacy [17]. However, most of the aforementioned work only considers the mean estimation problem and does not consider advanced statistical models [18, 19, 13]. Thus, these methods cannot be applied to linear models. [14] and [37] represent the closest work to ours, as they investigate the challenge of estimating (generalized) linear models from self-interested agents that have privacy concerns. However, as we mentioned above, they only consider the low dimensional case and their methods cannot be applied to the high dimensional sparse setting, see Section C for details. Classical approaches including the work by [55, 27, 43, 44, 45, 58, 1, 30, 41, 8, 3, 52, 54] for estimating linear regression models with differential privacy (DP) guarantee utilize central and local models. These approaches typically involve adding noise to the output of optimization methods, privatizing the objective function, or injecting noise into the gradients during optimization. However, the truthful data acquisition process follows a single-round interactive procedure between the analyst and the agents, in contrast to the multi-round interactions typically encountered in these approaches, indicating these methods cannot be applied to our problem. The closed-form expression of our estimator ensures that this requirement is satisfied. We believe that our method is non-trivial and it has the potential to extend to other related scenarios. Recently, [61] proposed a closed-form private estimator for sparse linear regression. However, as we discussed in Section 3.1, their estimator cannot be directly applied to our problem due to inherent limitations, such as challenges associated with the invertibility of the covariance matrix. Recently, there has been also some worth-noting literature on estimating linear regression from strategic agents (without privacy consideration). [25] and [6] have investigated regressing a linear model using data from strategic agents who can manipulate their associated costs but not the data. [29] explored a setting without direct payments, in which agents receive a utility that serves as a measure of estimation accuracy. [11] studied the case where agents may intentionally introduce errors to maximize their own benefits and proposed several group-strategyproof linear regression mechanisms. Additionally, [16] and [35] considered an analyst who aims to derive a "consensus" model from data provided by multiple strategic agents. In this setting, agents seek a consensus value that minimizes their individual data loss, and the authors demonstrated that empirical risk minimization is groupstrategyproof. However, all of these methods only consider the low dimensional setting and cannot be applied to the high dimensional sparse linear regression. C Challenges in Designing Truthful and Privacy-preserving Sparse Linear Regression In this section, we provide a retrospect of previous methods and then outline the explain why existing methods are not suitable for directly applying to our settings. The challenges associated with the problem make it fundamentally difficult to design a desirable mechanism. Under truthful setting, [14] proposed to use the general Ridge regression estimator, i.e., solving the regularized convex program: arg minθ 1 2n Y Xθ 2 2 + γ θ 2 2 where γ is some parameter and X = (x T 1 , , x T n)T Rn d, and Y = (y1, , yn)T Rn. The unique closed-form solution can be written as ˆθR(D) = (γI + XT X) 1XT Y . [37] considered the classical ordinary least square (OLS) estimator ˆθOLS = (ˆΣXX) 1 ˆΣXY where ˆΣXX = 1 n Pn i=1 xix T i and ˆΣXY = 1 n Pn i=1 xiyi. Both of the private estimators are based on the closed-form solution ˆθL(D) of the ridge regression. However, there are four problems with the above methods, ˆθR and ˆθOLS. C.1 Sparsity-encouraging and Private Regressor The ℓ2-norm accuracy errors of ˆθR and ˆθOLS are both bounded by O( p d/n). This rate is not convergent in the data-poor scenario, indicating that both of these two estimators are not suitable in our setting. Since ℓ2-regularization used in ˆθR is not a sparsity-encouraging technique, we shall seek some alternative regularization. A qualified estimator that satisfies our need must (1) exploit the sparsity assumption and (2) have a closed-form solution. Apart from the choice of (non-private) estimator, we also need to give some thoughts on the privacy mechanism. Previous work [14] and [37] both adopt the output perturbation method to ensure ε-JDP, i.e., adding some Gamma noise to ˆθR or ˆθOLS. According to the DP theory, the scale of the noise should be proportional to the ℓ2-norm sensitivity, which can be readily bounded by p d/n through clipping operation or if the data are assumed to be bounded. In the high dimensional setting, output perturbation could fail catastrophically as d n. The magnitude of noise needs to grow polynomially with p d/n, causing the estimation error much larger. This means we must find a privatization mechanism that gives a better accuracy guarantee. C.2 The Invertibility of Covariance Matrix The invertibility of the covariance matrix is also an issue in high dimensional space. The classical OLS estimator ˆθOLS used in [37] is no longer well-defined as ˆΣ X X is not full-rank. In the case of ˆθR = (ϱI + XT X) 1XT Y , although (ϱI + XT X) 1 always exists for some ϱ > 0, the caveat with such method is however its concentration only holds when n = Ω(d) [46]. To solve the invertibility issue posed by high dimensionality, we need to consider properly imposing some reasonable assumption on the structure of covariance matrix Σ, which will be formally introduced and serve to yield an efficient estimator for the covariance matrix with such assumption. C.3 Truthfulness Analysis Framework Due to the additional regularization parameter ϱ, ˆθL is a biased estimator of θ , which complicates the truthfulness analysis. This complexity arises because the OLS linear regression estimator ˆθOLS = XT X 1 XT y maximizes the scoring rule when players truthfully report their responses. However, for ˆθL(D), the bias caused by ϱ causes the optimal report of each player to deviate from truthful reporting by a quantity proportional to the bias. To address the above issue, [14] shows that as long as ϱ grows more slowly than n, the bias term of ˆθL converges to zero with high probability. More specifically, it is shown that the estimation error can be decomposed into the summation of a bias and a vari- ance term: E[ ˆθ(D) θ 2 2] = trace(Cov(ˆθ(D))) + bias(ˆθ(D)) 2 2, where trace(Cov(ˆθ(D))) = σ2 ϱI + XT X 1 XT X ϱI + XT X 1 , bias(ˆθ(D)) = ϱ ϱI + XT X 1 θ , σ2 is the variance of the noise variables ζi. However, it remains uncertain whether there is a closed-form biasvariance decomposition available for our estimator proposed in Section 3.1. D Limitation Although some truthful mechanism studies [12] provided payment schemes for multiple-round data acquisition, existing truthful linear regression research poses constraint to the mechanism where data are only collected once. In this paper we mainly follow the analysis framework of [14] to guarantee truthful mechanism properties, therefore our work is still hampered by the one-round communication constraint. It remains uncertain whether this constraint could be removed in the future. E Supporting lemmas Definition 10 (Sub-Gaussian random variable). A zero-mean random variable X R is said to be sub-Gaussian with variance σ2 X sub G σ2 if its moment generating function satisfies E[exp(t X)] exp σ2t2 2 for all t > 0. For a sub-Gaussian random variable X, its sub-Gaussian norm X ψ2 is defined as X ψ2 = inf{c > 0 : E[exp( X2 c2 )] 2}. Specifically, if X sub G(σ2) we have X ψ2 O(σ). Definition 11 (Sub-Gaussian random vector). A zero mean random vector X Rd is said to be sub-Gaussian with variance σ2 (for simplicity, we call it σ2-sub-Gaussian), which is denoted as X sub Gd σ2 , if X, u is sub-Gaussian with variance σ2 for any unit vector u Rd. Lemma 9. For a sub-Gaussian vector X sub Gd σ2 , with probability at least 1 δ we have Lemma 10 ([47]). Let X1, X2, , Xn be n (zero mean) random variables such that each Xi is sub-Gaussian with σ2. Then the following holds P max i n Xi t ne t2 P max i n |Xi| t 2ne t2 Below is a lemma related to the Gaussian random variable. We will employ it to bound the noise added by the Gaussian mechanism. Lemma 11. Let {x1, , xn} be n random variables sampled from Gaussian distribution N 0, σ2 . Then E max 1 i n |xi| σ p P max 1 i n |xi| t 2ne t2 Particularly, if n = 1, we have P ({|xi| t}) 2e t2 2σ2 . Lemma 12 (Hoeffding s inequality). Let X1, , Xn be independent random variables bounded by the interval [a, b]. Then, for any t > 0, Lemma 13 (Bernstein s inequality for bounded random variables). Let X1, X2, , Xn be independent centered bounded random variables, i.e. |Xi| M and E [Xi] = 0, with variance E X2 i = σ2. Then, for any t > 0, 2nσ2t + 2Mt Lemma 14 (Bound on threshold τα,β). Under the Assumption 5, τα,β 1 λ log 1 αβ . Lemma 15. For any w, w Rd and closed convex set C Rd we have ΠC(w) ΠC(w ) 2 w w 2, where ΠC is the projection operation onto the set C, i.e., ΠC(v) = arg minu C u v 2. E.1 Lemmas for privacy guarantee Lemma 16 (Post-processing). Let M : X n Y be an ε-differential private mechanism. Consider F : Y Z as an arbitrary randomized mapping. Then the mechanism F M is ε-differential private. Lemma 17 (Billboard lemma [26]). Let M : Dn O be an (ε, δ)-differential private mechanism. Consider a set of n functions πi : D O R, for i [n]. Then the mechanism M : Dn O Rn that computes r = M(D) and outputs M (D) = (r, π1(D1, r), , πn(Dn, r)), where Di is the agent i s data, is (ε, δ)-joint differential private. Theorem 18 (Parallel Composition Theorem). Let O1, O2, ..., Ok be n independent operations that satisfy (εi, δ-DP for each i [k], then the parallel composition of these operations guarantee (ε1 + ε2 + ... + εk, δ)-DP. Theorem 19 (Sequential Composition Theorem(Theorem 4 in [33])). If O1, O2, ..., Ok are sequential operations that satisfy ε-individual differential privacy with a failure probability of δ, then the sequential composition of these operations guarantees (ε1 +ε2 +...+εk, k δ)-individual differential privacy, where εi represents the ε of operation Oi, and k is the total number of operations performed. E.2 Technical lemmas for upper bounding the accuracy Lemma 20. Let ˆθ(D) and ˆθ(D ) be the estimators on two fixed datasets D, D that differ on at most k entries and let ˆD denote the fixed dataset that differs from D on at most one entries. Suppose that with probability at least 1 γn, if ˆθ(D) ˆθ( ˆD) 2 λn then we have with probability at least 1 gγn, it holds that ˆθ(D) ˆθ(D ) 2 gλn. Lemma 21 (Theorem 2 in [59]). Suppose we solve the problem of the form minθ θ θ 2 2 +λn θ 1, such that constraint term λn is set as λn θ θ(D) . Then, the optimal solution ˆθ = Sλn( θ) satisfies: ˆθ(D) θ 2λn, ˆθ(D) θ 2 4 kλn, ˆθ(D) θ 1 8kλn. Lemma 22. Under Assumption 1 and 6, we set τx = Θ( σ log n d ), τy = Θ(σ log n), r = Θ(σ log n), λn = O r2 log d log 1 , when n is sufficient large such that n Ω( s2r4 log d log 1 with probability at least 1 O(d Ω(1)), one has ˆθP (D) θ 2 Lemma 23. (Weyl s Inequality[42]) Let X, Y Rd d be two symmetric matrices, and E = X Y . Then, for all i = 1, , d, we have |λi(X) λi(Y )| E 2, where we take some liberties with the notation and use λi(M) to denote the i-th eigenvalue of the matrix M. E.3 Technical lemmas for covariance matrix estimation Lemma 24 ([4]). If {x1, x2, , xn} are n realizations of a (zero mean) σ2-sub-Gaussian random vector X with covariance matrix ΣXX = E[XXT ], and ˆΣ X X = ˆσ x x T ,ij 1 i,j d = 1 n Pn i=1 xi x T i is the empirical covariance matrix, then there exist constants C1 and γ > 0 such that for any i, j [d], we have: P ˆΣ X X ΣXX , > t C1e nt2 8 for all |t| ϕ with some ϕ, where C1 and γ are constants and depend only on σ2. Specifically, ˆΣ X X ΣXX , γ Lemma 25. For every fixed 1 i, j d, there exists a constant C1 > 0 such that with probability at least 1 C1d 9 2 , the following holds: σ x x T ,ij σxx T ,ij ( σxx T ,ij , γ 2 ln 1.25/δ log d F Proofs of Supporting Lemmas Proof of Lemma 15. Denote b = ΠC(w) and b = ΠC(w ). Since b and b are in C, so the segment bb is contained in C, thus we have for all t [0, 1], (1 t)b + tb w 2 b w 2. Thus dt tb + (1 t)b w 2 2|t=0 = 2 b b, b w Similarly, we have b b , b w 0. Now consider the function D(t) = (1 t)b + tw (1 t)b tw 2 2 = b b +t(w w +b b) 2 2, which is a quadratic function in t. And by the previous two inequalities we have D (0) = 2 b b , w w + b b 0. Thus D( ) is a increasing function on [0, 2), thus D(1) D(0) which means w w 2 b b 2. Proof of Lemma 14. We first bound τ 1 α,β. Since n = #{i : ci τ} + #{i : ci > τ}, the event {#{i : ci τ} (1 α)n} is equivalent to the event {#{i : ci > τ} αn}. Thus, by the definition of τ 1 α,β, τ 1 α,β = inf{τ > 0 : P(c1, ,cn) pn(#{i : ci > τ} αn) 1 β} = inf{τ > 0 : P(c1, ,cn) pn(#{i : ci > τ} > αn) β}, By Markov s inequality, we have P(c1, ,cn) pn(#{i : ci > τ} > αn) E(c1, ,cn) pn[Pn i=1 I{ci>τ}] αn = Pn i=1 Eci p[1{ci>τ}] αn = n P[ci > τ] αn = P[ci > τ] Thus, {τ > 0 : P(ci > τ) αβ} {τ > 0 : P(c1, ,cn) pn(#{i : ci > τ} > αn) β}, which implies τ 1 α,β inf{τ > 0 : P(ci > τ) αβ}. The Assumption 5 implies that P(ci > τ) e λτ. Hence, τ 1 α,β 1 λ log 1 αβ . By the definition of τ 2 α and Assumption 5, we have τ 2 α 1 α. Since β (0, 1), 1 λ log 1 αβ > 1 α, then τα,β = max{τ 1 α,β, τ 2 α} 1 λ log 1 αβ . Proof of Lemma 4. For the ease of presentation, we denote the initial parameter ( Σ 1 X X) Σ e X e Y on which the soft-thresholding Sλn is performed by θ. Let be the error vector = ˆθP (D) ˆθP ( ˆD). It follows that = ˆθP ( ˆD) θ( ˆD) + θ( ˆD) θ(D) + θ(D) ˆθP (D) ˆθP ( ˆD) θ( ˆD) + θ( ˆD) θ(D) + θ(D) ˆθP (D) (6) where we utilize the fact that ˆθP ( ˆD) is feasible. Using the property of generalized thresholding operator, we can bound the first term of equation 6 as ˆθP ( ˆD) θ( ˆD) λn. For the third term of equation 6, from Lemma 22, we know that θ(D) ˆθP (D) λn. Our next task it to show λn is indeed greater than the second term of equation 6 θ( ˆD) θ(D) . It can be expressed as θ( ˆD) θ(D) = Σ X X( ˆD) 1 Σ e X e Y ( ˆD) Σ X X(D) 1 Σ e X e Y (D) . Applying the inequality AB A B = AB AB + AB A B A B B + A A B , we have, Σ X X( ˆD) 1 Σ e X e Y ( ˆD) Σ X X(D) 1 Σ e X e Y (D) Σ X X( ˆD) 1 Σ e X e Y ( ˆD) Σ e X e Y (D) + Σ X X( ˆD) 1 Σ X X(D) 1 Σ e X e Y (D) For the first term of equation 7, we see from Lemma 1 that when n Ω s2r4 log d log 1 , Σ X X 1 2 κ . And by equation 15, we have with probability 1 2d 8 that Σ e X e Y ( ˆD) Σ e X e Y (D) = ˆΣ e X e Y ( ˆD) N2( ˆD) + N2(D) ˆΣ e X e Y (D) ˆΣ e X e Y ( ˆD) ˆΣ e X e Y (D) + N2( ˆD) + N2(D) n max i ( xi 2(| yi|) + xi 2(| yi|)) + 8rτy q For the second term of equation 7, note that for any two nonsingular square matrices A, B with the same size, it holds that A 1 B 1 = B 1(A B)A 1. Thus, by Lemma 1, we have Σ X X( ˆD) 1 Σ X X(D) 1 Σ X X( ˆD) 1 Σ X X(D) 1 Σ X X( ˆD) Σ X X(D) Σ X X( ˆD) Σ + Σ X X(D) Σ log d log( 1 Combining above, we have ˆθP ( ˆD) ˆθP (D) log d log( 1 To further our analysis, we introduce the notion of decomposibility of norm and subspace compatibility constant. We call M, M a subspace pairs where M is the model subspace in which the estimated model parameter θ and similarly structured parameters lie, and which is typically low-dimensional, while M is the perturbation subspace of parameters that represents perturbations away from the model subspace. Definition 12. A regularization function R is said to be decomposable with respect to a subspace pair M, M , if R(u + v) = R(u) + R(v), for all u M, v M . Note that when R( ) is a norm, by the triangle inequality, the LHS is always less than or equal to the RHS, so that the equality indicates the largest possible value for the LHS. For notational simplicity, we use (S, Sc) instead of an arbitrary subspace pair M, M . Definition 13. The subspace compatibility constant is defined as M : Ψ(M, ) := supu M\{0} R(u) It is noted that subspace compatibility constant that captures the relationship between the regularization function R( ) and the error norm , over vectors in subspace. In our proof, it is clear that the regularization is the ℓ1-norm, which is decomposable with respect to the subspace pair. We also denote the subspace of vectors in Rd by S. Since we assume k is the sparsity level of θ , the cardinality of the support set of the model space where the true parameter θ lies. It can be seen that any parameter θ S would be at-most k-sparse. Therefore, we have that Ψ (S, 2) Additionally, we use the notion ΠS( ) to represent the ℓ2 projection onto the model space S. Then, by the assumption of the statement that θ Sc = 0, and the decomposability of 1 with respect to (S, Sc), ˆθP ( ˆD) 1 = ˆθP ( ˆD) 1 + ΠSc( ) 1 ΠSc( ) 1 = ˆθP ( ˆD) + ΠSc( ) 1 ΠSc( ) 1 (i) ˆθP ( ˆD) + ΠSc( ) + ΠS( ) 1 + ΠS( ) 1 ΠSc( ) 1 = ˆθP ( ˆD) + 1 + ΠS( ) 1 ΠSc( ) 1 (9) where the equality (i) holds by the triangle inequality, which is the basic property of norms. Since we are minimizing the objective function ˆθP (D) 1, we obtain the inequality of ˆθP ( ˆD) + 1 = ˆθP (D) 1 ˆθP ( ˆD) 1. Combining this inequality with equation 9, we have 0 ΠS( ) 1 ΠSc( ) 1 (10) Armed with inequalities equation 8 and equation 10, we utilize the Hölder s inequality and the decomposability of our regularizer 1 in order to derive the error bounds in terms of ℓ2 norm: 2 2 = , 1 ( ΠS( ) 1 + ΠSc( )) 1. Since the error vector satisfies the inequality equation 10, 2 2 2 ΠS( ) 1 (11) Combining all the pieces together yields 2 2 4Ψ(S)λn ΠS( ) 2 where Ψ(M) is the abbreviation for Ψ (S, 2). Notice that the projection operator is non-expansive, ΠS( ) 2 2 2 2. Hence, we obtain ΠS( ) 2 4Ψ(S)λn, and plugging it back into equation 11 yields the ℓ2 error bounds. Proof of Lemma 20. Define a sequence of datasets D0, D1, , Dk, such that D0 = D, Dk = D , and for each i [k], Di, Di 1 differ on at most one agent s dataset. Then, by the triangular inequality, we obtain ˆθ(D) ˆθ(D ) 2 = ˆθ(D0) ˆθ(Dk) 2 = i=1 ˆθ(Di 1) ˆθ(Di) 2 i=1 ˆθ(Di 1) ˆθ(Di) 2 g n. with probability at least 1 kγn by taking a union bound over g failure probabilities γn. Proof of Lemma 22. Before giving theoretical analysis, we first prove that Σ X X is invertible with high probability. With the help of Weyl s Inequality (Lemma 23), we can see that to show Σ X X is invertible it is sufficient to show that Σ X X Σ 2 λmin(Σ) 2 . This is due to that by Lemma 23, we have λmin(Σ) Σ X X Σ 2 λmin( Σ X X). Thus, if Σ X X Σ 2 λmin(Σ) 2 , we have λmin( Σ X X) λmin(Σ) Thus, by Lemma 1, it is sufficient to show that λmin(Σ) O sr2 log d log 1 , which is true under the assumption of n Ω s2r4log d log 1 δ ε2λmin(Σ)2 . Thus, with probability at least 1 exp( Ω(d)) ξ, it is invertible. In the following we will always assume that this event holds. To prove the theorem, we first introduce the following lemma on the estimation error of ˆθ in equation 2. Note that this is a non-probabilistic result, and it holds deterministically for any selection of λn or any distributional setting of the covariates xi. Our goal is to show that λn θ Σ X X 1 Σ e X e Y under the assumptions specified in Lemma 21. θ ˆθP (D) = θ Σ X X 1 Σ e X e Y Σ X X θ bΣ e X e Y + N2 where the vector N2 Rd is sampled from N(0, 32τ 2 xτ 2 y log 1.25 δ n2ε2 Id). We first develop upper bound of Σ X X. For any nonzero vector w Rd, Note that Σ X Xw = Σ X Xw Σw + Σw Σw Σ X X Σ w Our objective is to find a sufficiently large n such that Σ X X Σ is less than κ By Lemma 1 we can see the following: Σ X X Σ 2 = Σ X X Σ 2 1 s2 log d log 1 1 q + log d log 1 Thus, when n Ω s2r4 log d log 1 , we have Σ X Xw κ 2 w , which implies Σ X X 1 2 κ . Given sufficiently large n, from equation 12, we have: θ ˆθP (D) Σ X X θ bΣ e X e Y + N2 bΣ e X e Y Σ e X e Y | {z } T1 + Σ e X e Y ΣY X | {z } T2 + Σ X X Σ θ | {z } T3 + N2 | {z } N2 We will bound the above four terms one by one. For 1 j d, we have Var ( yiexij) E ( yiexij)2 E (yixij)2 q Ey4 i Ex4 ij =: v1 < . In addition, E | yiexij|p (τxτy)p 2 v1 holds. Therefore, according to Lemma 13, we have: bσe Y exj σe Y exj where bσe Y exj = 1 n Pn i=1 yiexij, σe Y exj = E yiexij and c is a certain constant. Then by the union bound, the following can be derived: Next, we give an estimation of T2. Note that for 1 j d, by lemma 9 we have: E yiexij Eyixij = E yiexij E yixij + E yixij Eyixij = E yi (exij xij) + E ( yi yi) xij E y2 i (exij xij)2 P (|xij| τx) + E ( yi yi)2 x2 ij P (|yi| τy) 2e τ2 x 2σ2 + 2e τ2 y 2σ2 , which shows that T2 v1 2e τ2 x 2σ2 + 2e τ2 y 2σ2 . To upper bound term T3, we need to evaluate Σ X X Σ and we can reuse obtained results from Lemma 1. We can see that T3 is bounded by O( log d log 1 δ nε ). Here we used the fact that Σ X X Σ is a symmetric matrix. ( Σ X X Σ)θ Σ X X Σ 2 θ 2 Σ X X Σ 1 θ 2 (log d log 1 δ nε2 )) θ 2 , given the selection of r. The last term of equation 14 can be bounded by Gaussian tail bound by lemma 11. With probability 1 O(d 8), we have: Finally combining all pieces, we can find that T3 is the dominating term. Since λn θ Σ X X 1 Σ e X e Y , Lemma 21 implies that with probability at least 1 O(d 8) e Ω(d), θ h Σ X X i 1 bΣ e X e Y + N2 2 O k log d log 1 which completes our proof of Theorem. Proof of Lemma 1. Our goal is to prove E Σ X X Σ 2 2 ln 1.25/δ log d 2 ln 1.25/δ log d (16) for some constant C. Since Σ X X Σ is symmetric, we know that Σ X X Σ 1 = Σ X X Σ . Thus, it suffices to prove that the bound in equation 16 holds for Σ X X Σ 1. Define the event Aij by Aij = σ x x T ,ij σxx T ,ij 4 min σxx T ,ij , Thres . (17) Then by lemma 25, we have P (Aij) 1 2C1d 9/2. Let D = (dij)1 i,j d with dij = σ x x T ,ij σxx T ,ij I Ac ij , then the following holds. E Σ X X Σ 2 1 2E σ x x T ,ij σxx T ,ij I (Aij) 2 ln 1.25/δ log d ( σxx T ,ij , γ 2 ln 1.25/δ log d 2 ln 1.25/δ log d Note the first term in equation 18 is bounded by Cs2 q 2 ln 1.25/δ log d the first term is dominating, while the second term E D 2 1 is comparably negligible. By setting 2 ln 1.25/δ log d ( σxx T ,ij , γ 2 ln 1.25/δ log d 2 ln 1.25/δ log d 2 ln 1.25/δ log d 2 ln 1.25/δ log d + s1/q (k )1 1/q # 2 ln 1.25/δ log d which implies equation 16 if E D 2 1 = O 1 n . We shall now show that E D 2 1 = O 1 n . Note that: E D 2 1 d X ij E d2 ij I Ac ij σ x x T ,ij = σ x x T ,ij + d2 ij I Ac ij σ x x T ,ij = 0 ij E n σ x x T ,ij σxx T ,ij 2 I Ac ij o + d X ij Eσ2 xx T ,ij I Ac ij σ x x T ,ij = 0 Lemma 25 yields that P Ac ij 2C1d 9/2, and the Whittle inequality (Theorem 2 in [56]) implies σ x x T ,ij σxx T ,ij has all finite moments under the sub-Gaussianity condition. Hence, we have: ij E n σ x x T ,ij σxx T ,ij 2 I Ac ij o h E σ x x T ,ij σxx T ,ij 6i1/3 P2/3 Ac ij n d 3 = C8/n On the other hand, ij Eσ2 xx T ,ij I Ac ij σ x x T ,ij = 0 ij Eσ2 xx T ,ij I σxx T ,ij 4 2 ln 1.25/δ log d σ x x T ,ij γ 2 ln 1.25/δ log d ij σ2 xx T ,ij EI σxx T ,ij 4γ 2 ln 1.25/δ log d σxx T ,ij σ x x T ,ij σxx T ,ij γ 2 ln 1.25/δ log d ij σ2 xx T ,ij EI σ x x T ,ij σxx T ,ij > 3 σxx T ,ij 4 2 ln 1.25/δ log d ij σ2 xx T ,ij EI σxx T ,ij > 4γ 2 ln 1.25/δ log p I ˆσ x x T ,ij σxx T ,ij + |n1,ij| 3 ij σ2 xx T ,ij P ˆσ x x T ,ij σxx T ,ij 3 σxx T ,ij |n1,ij| \ ( σxx T ,ij > 4γ 2 ln 1.25/δ log d ij σ2 xx T ,ij P ˆσ x x T ,ij σxx T ,ij 3 σxx T ,ij |n1,ij| \ |n1,ij| 1 \ ( σxx T ,ij > 4γ 2 ln 1.25/δ log d ij σ2 xx T ,ij P ˆσ x x T ,ij σxx T ,ij 3 σxx T ,ij |n1,ij| \ |n1,ij| 1 σxx T ,ij \ ( σxx T ,ij > 4γ 2 ln 1.25/δ log d This gives us: ij σ2 xx T ,ij P ˆσ x x T ,ij σxx T ,ij 1 \ ( σxx T ,ij > 4γ 2 ln 1.25/δ log d ij σ2 xx T ,ij P |n1,ij| 1 \ ( σxx T ,ij > 4γ 2 ln 1.25/δ log d For the first term of equation 19, by Lemma 24 we have: ij σ2 xx T ,ij P ˆσ x x T ,ij σxx T ,ij 1 σxx T ,ij \ ( σxx T ,ij 4γ ij nσ2 xx T ,ij exp n 2σ2 xx T ,ij σxx T ,ij 4γ nσ2 xx T ,ij exp n σ2 xx T ,ij n σ2 xx T ,ij σxx T ,ij 4γ n d 16 = O 1 For the second term of equation 19, by Lemmas 11 and 24 we have ij σ2 xx T ,ij P σxx T ,ij \ ( σxx T ,ij > 4γ 2 ln 1.25/δ log d ij σ2 xx T ,ij P 2 ln 1.25/δ log d P |n1,ij| > 1 ij σ2 xx T ,ij exp n + 4σ1 log d 2 σ2 xx T ,ij 32σ2 1 Cσ2 1d d2 exp γ2 log d Cσ2 1d 5 2nσ2 1 γ2 log d = O log 1/δ Putting R1 and R2 together yields that for some constant C > 0, Proof of Lemma. 25. Firstly, let us note that Σ X X = ˆΣ X X + N1 = Pn i xi x T i + N1. Therefore by Lemma 24 and Lemma 11 with probability at least 1 Cd 8, for all 1 i, j d, and for some constants γ and C that depends on σN1, σ x x T ,ij σxx T ,ij γ log d log 1 Define the event Aij = σ x x T ,ij > γ q 2 ln 1.25/δ log d . Let ˆΣ X X = ˆσ x x T ,ij 1 i,j d and N1 = (n1,ij)1 i,j d.We have: σ x x T ,ij σxx T ,ij = σxx T ,ij I Ac ij + σ x x T ,ij σxx T ,ij I (Aij) . (21) By the triangle inequality, it is easy to see that ( σ x x T ,ij σxx T ,ij + σxx T ,ij > γ 2 ln 1.25/δ log d ( σ x x T ,ij σxx T ,ij > γ 2 ln 1.25/δ log d εn σxx T ,ij ) ( σ x x T ,ij σxx T ,ij + σxx T ,ij γ 2 ln 1.25/δ log d ( σ x x T ,ij σxx T ,ij > σxx T ,ij 2 ln 1.25/δ log d Depending on the value of σxx T ,ij, we need to consider the following three cases. Case 1. σxx T ,ij γ 2 log 1.25/δ log d For this case, we have: σ x x T ,ij σxx T ,ij > 3γ 2 ln 1.25/δ log d This is due to the fact: σ x x T ,ij σxx T ,ij > 3γ 2 ln 1.25/δ log d ˆσ x x T ,ij σxx T ,ij > 3γ 2 ln 1.25/δ log d Bij \ ( 3 p 2 ln 1.25/δ log d |n1,ij| > 0 Bij \ ( 3 p 2 ln 1.25/δ log d ˆσ x x T ,ij σxx T ,ij > 3γ 3 ln 1.25/δ log d where event Bij denotes Bij = ˆσ x x T ,ij σxx T ,ij > 3γ 2 ln 1.25/δ log d and the last inequality comes from lemma 11 and 24. Thus by equation 21, with probability at least 1 C1d 9 2 , we have: σ x x T ,ij σxx T ,ij = σxx T ,ij , which satisfies equation 5. Case 2. σxx T ,ij 2γ q 2 ln 1.25/δ log d For this case, we have σ x x T ,ij σxx T ,ij γ 2 ln 1.25/δ log d C1d 8 + 2d 8, where the proof is identical to the proof for case 1. Thus, with probability at least 1 C1d 9 2 2d 8, we have: σ x x T ,ij σxx T ,ij = σ x x T ,ij σxx T ,ij . Also, by equation 20, equation 5 also holds. Case 3. Otherwise, 2 log 1.25/δ log d nε σxx T ,ij 2γ 2 ln 1.25/δ log d For this case, we have σ x x T ,ij σxx T ,ij = σxx T ,ij or σ x x T ,ij σxx T ,ij . When σxx T ,ij γ q 2 ln 1.25/δ log d εn , we can derive from equation 20 that with probability at least 1 2d 6 C1d 8 σ x x T ,ij σxx T ,ij γ 2 ln 1.25/δ log d εn 4 σxx T ,ij . Thus, equation 5 holds whether σxx T ,ij γ q 2 ln 1.25/δ log d εn or not, which completes the proof of Lemma 25. G Omitted Proofs Proof of Theorem 2. By Gaussian mechanism and the post-processing processing property, it is easily to see that releasing Σ X X satisfies ( ε 2)-DP, releasing Σ e X e Y satisfies ( ε 2)-DP. Thus, the output of Algorithm 1 is (ε, δ)-DP. Next, we show that the output of the mechanism satisfies joint differential privacy using Billboard Lemma (Lemma 17). The estimators ˆθP ( ˆD0) and ˆθP ( ˆD1) are computed in the same way as ˆθP ( ˆD), so ˆθP ( ˆD0) and ˆθP ( ˆD1) each satisfy (ε, δ)-JDP. Since ˆθP ( ˆD0) and ˆθP ( ˆD1) are computed on disjoint subsets of the data, then by the Parallel Composition Theorem, together they satisfy (ε, 2δ)-JDP. By the Sequential Composition Theorem (Lemma 19), the estimators (ˆθP ( ˆD),ˆθP ( ˆD0),ˆθP ( ˆD1)) together satisfy (2ε, 3δ)-JDP. Finally, using the post-processing property and Billboard Lemma 17, the output ( θP ( ˆD), θP ( ˆD0), θP ( ˆD1), {πi(Di, θP ( ˆDb))}n i=1) of Algorithm 2 satisfies (2ε, 3δ)-JDP. Proof of Theorem 3. For any realization D held by agents, let ˆD = στα,β(D). Then by Lemma 15 we have E[ θP ( ˆD) θ 2 2] E ˆθP ( ˆD) θ 2 2 =E ˆθP ( ˆD) ˆθP (D) + ˆθP (D) θ 2 2 =E ˆθP ( ˆD) ˆθP (D) 2 2 + 2 ˆθP ( ˆD) ˆθP (D), ˆθP (D) θ + ˆθP (D) θ 2 2 2E ˆθP ( ˆD) ˆθP (D) 2 2 + 2E ˆθP (D) θ 2 2. (22) For the first term of equation 22, if we set the constraint bound λn = O r2 log d log 1 from Lemma 4 and Lemma 20 that when n is sufficient large such that n Ω( s2r4 log d log 1 δ ε2κ ), with probability at least 1 β O(d Ω(1)) we have E ˆθP ( ˆD) ˆθP (D) 2 16 For the last term of equation 22, by Lemma 22, when n is sufficient large such that n Ω( s2r4 log d log 1 δ ε2κ ), with probability at least 1 O(d Ω(1)), E ˆθ(D) θ 2 Combining equation 23 and equation 24 yields that with probability at least 1 β O(d Ω(1)), E[ θP ( ˆD) θ 2 2] E[ ˆθP ( ˆD) θ 2 2] O kα2 r4log d log 1 δ ε2 + r4log d log 1 Proof of Theorem 5. Suppose all agents other than i are following strategy στα,β. Let agent i be in group 1 b, b {0, 1}. We will show that στα,β achieves η-Bayesian Nash equilibrium by bounding agent i s incentive to deviate. Assume that ci τα,β, otherwise there is nothing to show because agent i would be allowed to submit an arbitrary report under στα,β. For ease of notation, we write σ for στα,β for the remainder of the proof. We first compute the maximum expected amount (based on their belief) that agent i can increase their payment by misreporting to the analyst, i.e. E h πi( ˆDi, σ(Db, cb))|Di, ci] E[πi(Di, σ(Db, cb))|Di, ci i = E h Ba1,a2 xi, θP ( ˆDb) ), xi, Eθ p(θ| ˆ Di)[θ] Di, ci i E h Ba1,a2 xi, θP ( ˆDb) , xi, Eθ p(θ|Di)[θ] Di, ci i . (25) Note that Ba1,a2(p, q) = a1 a2(p 2pq + q2) is linear with respect to p, and is a strictly concave function of q maximized at q = p. Thus, equation 25 is upper bounded by the following with probability 1 C1n Ω(1) Ba1,a2 h E h xi, θP ( ˆDb) |Di, ci i , E h xi, ˆθP ( ˆDb) |Di, ci ii Ba1,a2 h E[ xi, θP ( ˆDb) |Di, ci], xi, Eθ p(θ|Di)[θ] i = a2 E[ xi, θP ( ˆDb) |Di, ci] xi, Eθ p(θ|Di)[θ] 2 = a2 E[ xi, θP ( ˆDb) xi, Eθ p(θ|Di)[θ] |Di, ci] 2 a2 E[ x T i ( θP ( ˆDb) Eθ p(θ|Di)[θ])|Di, ci] 2 a2 xi 2 2 E[ θP ( ˆDb) Eθ p(θ|Di)[θ]|Di, ci] 2 2 r2a2 E[ θP ( ˆDb) Eθ p(θ|Di)[θ]|Di, ci] 2 2. We continue by bounding the term E[ θP ( ˆDb) Eθ p(θ|Di)[θ]|Di, ci] 2. By Lemma 15 E[ θP ( ˆDb) Eθ p(θ|Di)[θ]|Di, ci] 2 E[ θP ( ˆDb) θP (Db)|Di, ci] 2 + E[ θP (Db)|Di, ci] Eθ p(θ|Di)[θ]|Di, ci] 2 E[ˆθP ( ˆDb) ˆθP (Db)|Di, ci] 2 + E[ θP (Db)|Di, ci] Eθ p(θ|Di)[θ]|Di, ci] 2 ˆθP ( ˆDb) ˆθP (Db) 2 + E[ θP (Db)|Di] Eθ p(θ|Di)[θ] 2 (26) For the first term of equation 26, since agent i believes that with at least probability 1 β, at most αn agents will misreport their datasets under threshold strategy στα,β, datasets Db and ˆDb differ only on at most αn agents datasets. By Lemma 20 and Lemma 4, we set the constraint bound λn = O r2 log d log 1 , when n is sufficient large such that n Ω( s2r4 log d log 1 δ ε2κ ), with probability at least 1 β O(αnd Ω(1)) we have that E ˆθP ( ˆD) ˆθP (D) 2 16 For the second term of equation 26 : E[ θP (Db)|Di] Eθ p(θ|Di)[θ] = EDb p(Db|Di)[ θP (Db)] Eθ p(θ|Di)[θ] = Eθ p(θ|Di)[EDb p(Db|θ)[ θP (Db)]|θ] Eθ p(θ|Di)[θ] = Eθ p(θ|Di)[EDb p(Db|θ)[ θP (Db) θ]|θ]. p(Db|θ) = p(Xb, yb|θ) = p(yb|Xb, θ)p(Xb|θ) = p(yb|Xb, θ)p(Xb), EDb p(Db|θ)[ˆθ(Db) θ] = EXb[Eyb[ θP (Xb, yb) θ]|Xb, θ]. Since we have the prior knowledge that θ 2 τθ. Thus, for the posterior distribution θ p(θ| ˆDi) it will also have θ 2 τθ. By Jensen s inequality, Theorem 3 and Lemma 15, we have E[ θP (Db)|Di] Eθ p(θ|Di)[θ] 2 Eθ p(θ|Di),Xb[Eyb[ θP (Xb, yb) θ 2|Xb, θ]] Eθ p(θ|Di),Xb[Eyb[ ˆθP (Xb, yb) θ 2|Xb, θ]] k log d log 1 In addition to an increased payment, agent i may also experience decreased privacy costs from misreporting. By Assumption 3, this decrease in privacy costs is bounded above by ci8(1 + 3δ)ε3. Since we have assumed ci τα,β, the decrease in privacy costs for agent i is bounded above by τα,β8(1 + 3δ)ε3. Hence, agent i s total incentive to deviate is bounded above by α2r6k log d log 1 δ ε2 + r4log d log 1 + τα,βδε2 . Proof of Theorem 6. Let agent i have privacy cost ci τα,β and consider agent i s utility from participating in the mechanism. Suppose agent i is in group 1 b, then their expected utility is E[ui] = E h Ba1,a2 xi, θP ( ˆDb) , xi, Eθ p(θ| ˆ Di)[θ] |Di, ci i fi(ci, ε) Ba1,a2 E( xi, θP ( ˆDb) )|Di, ci, xi, Eθ p(θ| ˆ Di)[θ] τα,β8(1 + 3δ)ε3. (28) Ba1,a2(p, q) = a1 a2(p 2pq + q2) a1 a2(|p| + 2|p||q| + |q|2), (29) Since both | xi, θP ( ˆDb) | and | xi, Eθ p(θ| ˆ Di)[θ] | are bounded by xi 2 ˆθ( ˆDb) 2 rτθ, thus by equation 28 and equation 29 agent i s expected utility is non-negative as long as a1 a2(rτθ + 3r2τ 2 θ ) + τα,β8(1 + 3δ)ε3. Proof of Theorem 7. Note that Ba1,a2(p, q) Ba1,a2(p, p) = a1 a2(p p2) a1 + a2(|p| + |p|2), i=1 E[πi] = i=1 E[Ba1,a2 xi, θP ( ˆDb) , xi, Eθ p(θ| ˆ Di)[θ] |Di, ci] n(a1 + a2(rτθ + r2τ 2 θ )). Proof of Corollary 8. For any ξ ( 1 2) and c > 0, we set ε = n ξ, δ = n Ω(1), α = Θ(n 3ξ), β = Θ(n c), a1 = a2(rτθ + 3r2τ 2 θ ) + τα,β8(1 + 3δ)ε3, a2 = O(n 3ξ). Note that by choosing ξ ( 1 2), we ensure that αn = o(1). Recall that by Theorem 3, the private estimator is O knα2 r4log d log 1 ξ ε2 + r4log d log 1 Note that O( nα2 ε2 ) = O(n1 4ξ), O( 1 nε2 ) = O(n2ξ 1). Since for any ξ ( 1 2), we always have 1 4ξ < 2ξ 1, we obtain E θP ( ˆD) θ 2 2 = O(n2ξ 1). To bound the expected budget and truthfulness, we first consider bounding the threshold value τα,β and the term 8(1 + 3δ)ε3. By Lemma 14, τα,β 1 λ log 1 αβ = Θ( 3ξ+c λ log n) = eΘ(1). Combining these, we get τα,β8(1 + 3δ)ε3 = O(n 3ξ). Now we bound the term η for the truthfulness. Recall that by Theorem 5, the first term of the truthfulness bound is a2 nα2r6k log d log 1 δ ε2 + r4log d log 1 δ nε2 = O(n 1 ξ) , and the second term of the bound is τα,β8(1 + 3δ)ε3 = O(n 3ξ), thus η = O(n 1 ξ + n 3ξ) = O(n 1 ξ). Then we consider individual rationality. By the choice of a1 and Theorem 6, the mechanism is individual rational for at least 1 O(n 3ξ) fraction of agents. Lastly, we consider total payment made to the agents. By Theorem 7, the total expected budget is B = O n(a1 + a2(rτθ + r2τ 2 θ )) = O(n(a2 + ε3 + a2)) = O(n1 3ξ). H Impact Statement This paper presents work whose goal is to advance the field of machine learning theory. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We have provided a detailed description of the claims in Section 1 which summarizes the major contributions. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We do not solve the inherent one-round communication constraint as mentioned in Section 1 and Section D. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We have introduced all the assumptions in the main paper and we have also included the complete proofs of the theorems in the Apppendix. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: Our paper does not include experiments. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: Our paper does not include experiments. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: Our paper does not include experiments. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: Our paper does not include experiments. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: Our paper does not include experiments. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Our paper conforms with the Neur IPS Code of Ethics. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: This paper presents work whose goal is to advance the field of machine learning theory. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: Our paper poses no such risks. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: Our paper does not use existing assets. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: Our paper does not release new assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Our paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Our paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.