# private_regression_via_datadependent_sufficient_statistic_perturbation__14773b25.pdf

Published in Transactions on Machine Learning Research (08/2025)

Private Regression via Data-Dependent Sufficient Statistic Perturbation

Cecilia Ferrando cferrando@cs.umass.edu Manning College of Information and Computer Sciences University of Massachusetts Amherst

Daniel Sheldon sheldon@cs.umass.edu Manning College of Information and Computer Sciences University of Massachusetts Amherst

Reviewed on Open Review: https: // openreview. net/ forum? id= gt Cf DKm9ME

Sufficient statistic perturbation (SSP) is a widely used method for differentially private linear regression. SSP adopts a data-independent approach where privacy noise from a simple distribution is added to sufficient statistics. However, sufficient statistics can often be expressed as linear queries and better approximated by data-dependent mechanisms. In this paper we introduce data-dependent SSP for linear regression based on post-processing privately released marginals, and find that it outperforms state-of-the-art data-independent SSP. We extend this result to logistic regression by developing an approximate objective that can be expressed in terms of sufficient statistics, resulting in a novel and highly competitive SSP approach for logistic regression. We also make a connection to synthetic data for machine learning: for models with sufficient statistics, training on synthetic data corresponds to data-dependent SSP, with the overall utility determined by how well the mechanism answers these linear queries.

1 Introduction

Differential privacy (DP) (Dwork et al., 2006) is an established mathematical framework for protecting user privacy while analyzing sensitive data. A differentially private algorithm injects calibrated random noise into the data analytic process to mask the membership of single records in the data, limiting the information revealed about them in the output of the privatized algorithm. The literature encompasses numerous methods for achieving differential privacy across a wide range of machine learning algorithms, including objective perturbation (Chaudhuri and Monteleoni, 2008; Chaudhuri et al., 2011; Kifer et al., 2012; Jain and Thakurta, 2013), with applications to models trained via empirical risk minimization; gradient perturbation (Bassily et al., 2014; Abadi et al., 2016), which is commonly used in deep learning and models trained via gradient descent; one-posterior sampling (Wang et al., 2015; Dimitrakakis et al., 2017) with applications in private Bayesian inference; and finally, sufficient statistic perturbation (SSP) (Vu and Slavkovic, 2009; Mc Sherry and Mironov, 2009; Dwork and Smith, 2010; Zhang et al., 2016; Foulds et al., 2016; Wang, 2018; Bernstein and Sheldon, 2019; Ferrando et al., 2022), with natural applications in exponential family estimation and linear regression.

SSP adds calibrated random noise to the sufficient statistics of the problem of interest and uses the noisy sufficient statistics downstream to retrieve the target estimate. It is appealing for a number of reasons. Sufficient statistics are by definition an information bottleneck, in that they summarize all the information about the model parameters (Fisher, 1922). For many models, like linear regression and exponential family distributions, their sensitivity is easy to quantify or bound, simplifying the DP analysis. Finally, they can be privatized via simple additive mechanisms, like the Laplace or Gaussian mechanism (Dwork et al., 2014).

Published in Transactions on Machine Learning Research (08/2025)

Existing SSP methods are data-independent, meaning they add noise to the sufficient statistics in a way that does not depend on the underlying data distribution. In a different branch of DP research, recent work has shown that data-dependent mechanisms are the most effective for query answering and synthetic data (Hardt et al., 2012; Gaboardi et al., 2014; Zhang et al., 2017; Aydore et al., 2021; Liu et al., 2021; Mc Kenna et al., 2022).

In this paper, we introduce DD-SSP, a data-dependent SSP method that leverages private linear query answering to release differentially private (DP) sufficient statistics. Its most immediate application is to linear regression, where finite sufficient statistics exist and, as we demonstrate, can be estimated privately through simple transformations of DP pairwise marginals. Furthermore, we extend the application of DD-SSP to models without defined finite sufficient statistics by proposing a novel framework for logistic regression, where approximate sufficient statistics are derived and released in a data-dependent way to train the model by optimizing an approximate loss function.

The proposed framework can be used with virtually any DP query answering algorithm. In this paper, we use AIM (Mc Kenna et al., 2022) as our primary method, but show the overall method is robust to different choices. The main advantage of AIM is its demonstrated accuracy at preserving marginal queries, albeit at the cost of restriction to discrete data (a requirement in AIM). We show experimentally that DD-SSP outperforms the state-of-the-art data-independent SSP method Ada SSP (Wang, 2018) for linear regression, and for logistic regression tasks, DD-SSP achieves better results than the widely used objective perturbation baseline. We also compare DD-SSP with DP-SGD (Abadi et al., 2016), known to achieve excellent performance when hyperparameters are properly fine-tuned. Our results show that the proposed method is competitive with DP-SGD when the privacy cost of hyperparameter tuning is taken into account.

Finally, we elaborate on the significance of our results with respect to the increasingly popular practice of training machine learning models on DP synthetic data. Our results support the observation that for these models training on synthetic data generated by linear-query preserving mechanisms effectively corresponds to a form of data-dependent SSP.

2 Background

2.1 Differential privacy

Differential privacy (DP) (Dwork et al., 2006) has become the preferred standard for preserving user privacy in data analysis, and it has been widely adopted by private and governmental organizations. Differential privacy allows many data computations (including statistical summaries and aggregates, and the training of various predictive models) to be performed while provably meeting privacy constraints. The concept of neighboring datasets is integral to differential privacy, which aims to limit the influence of any one individual on the algorithmic output in order to safeguard personal privacy.

Definition 2.1 (Neighboring datasets). Two datasets D and D are considered neighbors (D D ) if D

can be created by adding or deleting a single record from D.

Based on the concept of neighboring datasets, we can define the sensitivity of a function:

Definition 2.2 (L2 sensitivity). Given a vector-valued function of the data f : D Rp, the L2 sensitivity of f is defined as (f) = max D D f(D) f (D ) 2.

Differential privacy can be achieved via different mechanisms of addition of calibrated random noise, with slightly different definitions. In this paper, we adopt (ϵ, δ)-DP, which can be achieved via the Gaussian Mechanism.

Definition 2.3 ((ϵ, δ)-Differential Privacy). A randomized mechanism M : D R satisfies (ϵ, δ)-differential privacy if for any neighbor datasets D D D, and any subset of possible outputs S R

Pr[M(D) S] exp(ϵ) Pr [M (D ) S] + δ

Published in Transactions on Machine Learning Research (08/2025)

Definition 2.4 (Gaussian mechanism). Let f : D Rp be a vector-valued function of the input data. The Gaussian mechanism is given by

M(D) = f(D) + ν

where ν is random noise drawn from N(0, σ2Ip) with variance σ2 = 2 ln(1.25/δ) (f)2/ϵ2 and (f) is the L2-sensitivity of f. That is, the Gaussian mechanism adds i.i.d. Gaussian noise to each entry of f(D) with scale σ dependent on the privacy parameters.

Proposition 2.5. (Dwork et al., 2014) The Gaussian mechanism in Definition 2.4 satisfies (ϵ, δ)-DP.

Proposition 2.6 (Post-processing property of DP). (Dwork et al., 2014) If M(D) is (ϵ, δ)-DP, then for any deterministic or randomized function g, g(M(D)) satisfies (ϵ, δ)-DP.

2.2 Data-independent sufficient statistic perturbation

Sufficient Statistic Perturbation (SSP) is a widely used differential privacy mechanism that introduces privacy noise at the level of summary statistics. For example, differentially private linear regression is commonly achieved by directly perturbing the sufficient statistics XT X and XT y via an additive noise mechanism, such as the Gaussian mechanism, where X Rn d is the feature matrix and y Rd is the response vector. The noise is added independently to each entry of these statistics based on the privacy parameters and the global sensitivity: the sensitivity of XT X is bounded by X 2, and that of XT y by X Y , where X and Y are bounds on the feature and target vectors respectively (for a detailed derivation of the sensitivity bounds, see Appendix D). Specifically, SSP for linear regression is usually performed by splitting the privacy budget into (ϵ1, δ1) and (ϵ2, δ2) so that ϵ1 + ϵ2 = ϵ and δ1 + δ2 = δ, and computing the following:

\ XT X = XT X + νZ1 where Z1 Rd d is a symmetric matrix with upper-triangular entries sampled from N(0, 1), and the noise scale is ν2 = 2 ln(1.25/δ1) X 4

[ XT y = XT y + υZ2 where Z2 N(0, Id), and the noise scale is υ2 = 2 ln(1.25/δ2) X 2 Y 2

The privatized estimates \ XT X and [ XT y are then used to compute the private estimator ˆθ = \ (XT X) 1[ XT y. This approach is simple and effective, but treats all entries in the sufficient statistics uniformly, adding entry-wise noise components calibrated to the same scale that only depends on the data through measures of global sensitivity. This implies that the noise mechanism does not exploit the structure of the underlying data to determine how the privacy noise is allocated. We call this mechanism data-independent SSP.

Later in this paper, we will instead leverage data-dependent privacy mechanisms for estimating sufficient statistics. Data-dependent mechanisms have gained traction in the field of differentially private data generation as the most effective mechanisms for query answering and synthetic data. The next section introduces the relevant background on differentially private synthetic data to introduce the context and motivation for our proposed data-dependent SSP methods.

2.3 Differentially private synthetic data and query-answering mechanisms

Differentially private synthetic data generation aims to produce surrogate data that preserves key statistical properties of the original data while ensuring privacy (Hardt et al., 2012; Zhang et al., 2017; Xie et al., 2018; Jordon et al., 2019; Mc Kenna et al., 2019; Rosenblatt et al., 2020; Vietri et al., 2020; Liu et al., 2021; Aydore et al., 2021; Mc Kenna et al., 2021b; Vietri et al., 2022; Mc Kenna et al., 2022). Rather than perturbing the data or a downstream model directly, these methods fit a generative model to privatized statistics of the data, then sample from the model to produce surrogate data. This synthetic data can then be used for a wide range of analytical tasks without further privacy cost.

The following definitions allow us to formally elaborate on differentially private synthetic data.

Definition 2.7 (Dataset). A dataset D is defined as a collection of n potentially sensitive records. Each record χ(i) D is a d-dimensional vector (χ(i) 1 , . . . , χ(i) d ).

Published in Transactions on Machine Learning Research (08/2025)

Data Sensitive

DP Marginals

DP Synthetic

DP Sufficient

DP Sufficient

DP noise DP noise

SSP Linear Query Answering/Synthetic Data

Figure 1: Diagram of the SSP data-independent workflow (left) vs the data-dependent linear query answering mechanism for marginal release/synthetic data workflow (right). Quantities indicated in blue follow privacy noise injection and are differentially private.

Definition 2.8 (Domain). The domain of possible values for attribute χ(i) j is Ωj = {1, . . . , mj}. The full domain of possible values for χ(i) is thus Ω= Ω1 Ωd which has size Q

We will later talk about numerical encodings of attributes (Section 3.1).

Definition 2.9 (Marginals). Let r [d] be a subset of features, Ωr = Q

j r Ωj, mr = |Ωr|, and χr = (χj)j r. The marginal on r is a vector µr Rmr indexed by domain elements t Ωr such that each entry is µr[t] = P

χ D 1 [χr = t] (i.e., counts).

With marginal queries, one record can only contribute a count of one to a single cell of the output vector. For this reason, the L2 sensitivity of a marginal query Mr is 1, regardless of the attributes in r. This facilitates the differential privacy analysis for marginal queries.

Definition 2.10 (Workload). A marginal workload W is defined as a set of marginal queries r1, . . . , r K where rk [d].

The goal of workload-based synthetic data generation models to minimize the approximation error on workload queries.

A widely used framework in this space is the select-measure-reconstruct paradigm (Hay et al., 2009; Li et al., 2010; Ding et al., 2011; Xiao et al., 2012; Li and Miklau, 2012; Xu et al., 2013; Yaroslavtsev et al., 2013; Li et al., 2014; Qardaji et al., 2014; Zhang et al., 2014; Li et al., 2015; Mc Kenna et al., 2021a), in which a workload of marginal queries is selected, privately measured, and then used to reconstruct either the full data distribution or a synthetic dataset. Many algorithms in this class are data-dependent: the choice of which queries to measure is based on the data itself, leading to better utility for a fixed privacy budget.

One example of select-measure-reconstruct method, and the one we use primarily in this paper, is AIM (Mc Kenna et al., 2022). AIM (Adaptive and Iterative Mechanism) is a state-of-the-art data-dependent method that selects marginal queries to measure via a scoring function that assesses the expected utility of the measurements based on the data. This data-awareness allows AIM to focus the privacy budget where it matters most, depending on the data distribution. Note that AIM uses zero-concentrated differential privacy (z CDP) (Bun and Steinke, 2016), an alternative privacy definition; the Gaussian mechanism as defined in Section 2.1 satisfies 1 2σ2 -z CDP (Bun and Steinke, 2016). In our experiments, we work with (ϵ, δ)-DP and the conversion to z CDP is handled internally in AIM.

In our experiments, we use AIM directly to release marginals without sampling synthetic data, treating it as a general-purpose, data-dependent linear query release mechanism.

Published in Transactions on Machine Learning Research (08/2025)

In this paper, we propose methods to privately estimate sufficient statistics in a data-dependent way. Specifically, our methods leverage privately released marginals computed by a data-dependent linear query answering algorithm to estimate sufficient statistics. We call this set of methods DD-SSP. The main advantage of DD-SSP is that it is data-dependent. Our approach stems from the insight that, while data-independent noise addition is a simple and established approach to SSP, many sufficient statistics can be expressed as linear queries, creating an opportunity to improve utility by using data-dependent query-answering DP mechanisms, which often achieve higher accuracy than simple additive noise mechanisms (Mc Kenna et al., 2022). In fact, using private synthetic data to train certain models is already a form of SSP. Figure 1 compares a standard SSP workflow and our proposed DD-SSP workflows. As seen in the figure, the pipeline of releasing synthetic data and then training a model via sufficient statistics can be viewed as a specific way of privatizing sufficient statistics for model training.

For linear regression, the application is straightforward: the problem has finite sufficient statistics and we demonstrate that two-way marginal queries are sufficient for their estimation. For other models, finite sufficient statistics are not available, but a polynomial approximation of the loss functions provides approximate sufficient statistics. This is the case for logistic regression, where finite sufficient statistics do not exist, but a Chebyshev polynomial approximation based on Huggins et al. (2017) allows us to propose an approximate version of the learning objective based on approximate sufficient statistics that can be expressed as linear queries, again retrievable via two-way marginal tables.

We use the synthetic data mechanism AIM as our private query answering algorithm and modify its implementation to output marginals directly, without the need to execute the synthetic data generation step. Depending on the input workload, AIM will privately release marginals that preserve certain linear queries more accurately. We find that a two-way marginal workload is sufficient for estimating or approximating the sufficient statistics for both linear and logistic regression. The proposed method is amenable to generalization beyond these classes of problems and can be potentially extended to others by i) identifying or approximating their sufficient statistics and ii) customizing the workload passed on as input in AIM accordingly. Since our proposed methods are based on post-processing DP workload query answers, the differential privacy analysis is straightforward (Definition 2.6).

3.1 Numerical encoding

We assume discrete (or discretized) input data, which is a common format for tabular data and is required by AIM and other marginal-based approaches. However, for machine learning, each record χ must be mapped to a numerical vector z = (x, y) where x Rp is a feature vector and y R is a target. While the details of this encoding are often overlooked, they are important here for two reasons. First, they are needed to tightly bound x , which is used in sensitivity calculations of a number of DP ML methods, with tighter bounds leading to higher utility. Second, the encoding is a key part of recovering sufficient statistics from marginals.

Let ψj(χj) Rmj be the one-hot encoding of χj, i.e., the vector with entries ψj,s(χj) = 1[χj = s] for each s Ωj. We consider any numerical encoding of the form χj 7 Ajψj(χj) where Aj Rpj mj is a fixed linear transformation applied to the one-hot vector. This covers two special cases of interest. The first is the scalar encoding with Aj = v T j for a vector vj Rmj that specifies the numerical value for each s Ωj. In this case the mapping simplifies to χj 7 vj(χj). The second special case of interest is when Aj = Ij is the mj mj identity matrix, so the mapping simplifies to χj 7 Aj(χj) to give the one-hot encoding itself. Another common variation is a reduced one-hot encoding where Aj = Ij R(mj 1) mj is equal to Ij with one row dropped to avoid redundant information in the one-hot encoding. We include a simple example of the encoding strategy in Appendix B.

The full encoded record is z = (z[j])d j=1 where z[j] = Ajψj(χj) Rpj is the encoding of the jth attribute and these column vectors are concatenated vertically. A single entry of z is selected as the target variable y leaving a feature vector x of dimension p := (Pd j=1 pj) 1. Later, we will also use indexing expressions like ( )[j] and ( )[j,k] to refer to blocks of a vector or matrix corresponding to the encoding of the jth and kth attributes.

Published in Transactions on Machine Learning Research (08/2025)

Let z(i) = (x(i), y(i)) denote the encoding of record χ(i) and let X Rn p be the matrix with ith row equal to (x(i))T and y Rn be the vector with ith entry equal to y(i). Many DP ML methods require bounds on the magnitude of the encoded data. Let X = supx X x and Y = supy Y |y| be bounds provided by the user where X Rp and Y R are guaranteed to contain all possible encoded feature vectors x and target values y, respectively. For example, a typical bound is X = x+ where x+ k sup |xk| bounds the magnitude of a single feature. If xk is the scalar encoding of χj we can take x+ k = maxs Ωj |vj(s)|. The following proposition describes how to tightly bound a feature vector that combines scalar features and one-hot encoded features.

Proposition 3.1. Suppose x = (u, w) where u Ra satisfies u U and w Rb contains the one-hot encodings (either reduced or not reduced) of c attributes. Then X := p

U 2 + c is an upper bound on x .

Proof. x 2 = u 2 + w 2 U 2 + c where w 2 c because w is the concatenation of c vectors each with at most a single entry of 1 and all other entries equal to 0.

Suppose u consists of scalar features and U is obtained by bounding each one separately as described above. This bound is tighter than the naive one of p

U 2 + b that would be obtained by bounding each entry of the one-hot vectors separately.

3.2 Linear regression

The goal of linear regression is to minimize the sum of squared differences between the observed values y and predicted values Xθ in a linear model with θ Rp. The ordinary least squares (OLS) estimator is obtained by minimizing the squared error loss function y Xθ 2. Mathematically, the OLS estimator is given by ˆθ = (XT X) 1XT y. In this context, the sufficient statistics are T(X, y) = {XT X, XT y}. In DD-SSP, we approximate T(X, y) using linear queries. Specifically, we show that each entry of XT X and XT y can be obtained from pairwise marginals. The sufficient statistics we will consider all have the form of empirical second moments of the encoded attributes.

3.2.1 Sufficient Statistics from Pairwise Marginals

Let Z = [X, y] Rn (p+1). The matrix ZT Z has blocks that contain our sufficient statistics of interest:

ZT Z = XT X XT y y T X y T y

However, we will see that we can also construct ZT Z directly from marginals.

Proposition 3.2. Let (ZT Z)[j,k] be the block of ZT Z with rows corresponding to the jth attribute encoding z[j] = Ajψj(χj) and columns corresponding to the kth attribute encoding z[k] = Akψk(χk). Then

(ZT Z)[j,k] = Aj µj,k AT k

where µj,k Rmj mk is the (j, k)-marginal shaped as a matrix with (s, t) entry µj,k[s, t] = Pn i=1 1[χ(i) j =

s, χ(i) k = t]. Note that according to this definition µj,j = diag(µj).

This shows that we can reconstruct the sufficient statistic matrix ZT Z directly from the set of all singleattribute and pairwise marginals. Note that single-attribute marginals µj can be constructed from any µj,k with k = j.

Proof. The sufficient statistic matrix can be written as ZT Z = Pn i=1 z(i)(z(i))T . Indexing by blocks gives

(ZT Z)[j,k] =

i=1 z(i) [j] (z(i) [k])T

Published in Transactions on Machine Learning Research (08/2025)

i=1 Ajψj(χ(i) j )ψk(χ(i) k )T AT k

i=1 ψj(χ(i) j )ψk(χ(i) k )T AT k

= Aj µj,k AT k ,

In the last line, we used that ψj(χ(i) j )ψk(χ(i) k )T is a matrix with (s, t) entry equal to 1[χ(i) j = s, χ(i) k = t], so summing over all i gives the matrix µj,k .

Algorithm 1 outlines how to retrieve approximate sufficient statistics XT X and ] XT y from marginals privately

estimated by AIM. This implies we can solve DP linear regression by i) retrieving XT X and ] XT y as outlined

in Algorithm 1, and ii) finding ˆθDP = XT X 1] XT y.

Algorithm 1 DD-SSP

1: M DPQuery(D, ϵ, δ) is the collection of privately computed pairwise marginal tables µj,k for all attribute pairs (j, k), computed by DP query release algorithm of choice DPQuery.

2: ( ] ZT Z)[j,k] Aj µj,k AT k for all attribute pairs (j, k) (see Proposition 3.2)

3: Extract XT X and ] XT y from ] ZT Z using the block structure of Equation (1)

Proposition 3.3. DD-SSP is (ϵ, δ)-DP.

The proof follows directly from the (ϵ, δ)-DP properties of the marginal-releasing algorithm (in our case, AIM), and the fact that all subsequent steps are post-processing of a DP result (Definition 2.6).

3.3 Logistic regression

Logistic regression predicts the probability a binary label y { 1, +1} takes value +1 as p = 1/(1+exp( x θ)), where θ Rp is a coefficient vector and x θ is the dot-product. The log-likelihood function is

i=1 ϕ(x(i) θy(i))

where ϕ(s) := log(1 + e s). Optimizing this log-likelihood is a convex optimization problem solvable numerically via standard optimizers. The log-likelihood does not have finite sufficient statistics. However, Huggins et al. (2017) offers a polynomial strategy to obtain approximate sufficient statistics for generalized linear models (GLMs), including logistic regression. Kulkarni et al. (2021) used a similar polynomial approximation for private Bayesian GLMs.

We propose a novel DP logistic regression method that combines two ideas: i) we use a Chebyshev approximation of the logistic regression log-likelihood based on Huggins et al. (2017), which allows us to write the objective in terms of approximate sufficient statistics, and then ii) use AIM privately released marginals to estimate the approximate sufficient statistics without accessing the sensitive data. This gives us the option to directly optimize an approximate log-likelihood based on privatized linear queries computed by AIM. The choice of the input workload for AIM depends on the characterization of the approximate log-likelihood. Based on our derivation below, we find that a suitable workload input for logistic regression is all pairwise marginals.

Huggins et al. (2017) propose to approximate ℓ(θ) by using an degree-M polynomial approximation of the function ϕ:

ϕ(s) ϕM(s) :=

m=0 bm (M)sm

Published in Transactions on Machine Learning Research (08/2025)

6 4 2 0 2 4 6 s

Figure 2: Degree 2 Chebyshev approximation of the logit function ϕ, where ϕ(s) := log(1 + e s). The inner products y(i)x(i), θ tend to be concentrated in the range [ 4, 4] across many datasets (Huggins et al., 2017). We conservatively choose range [ 6, 6] based on the inner DP products of the chosen datasets.

where b(M) j are constants. There are different choices for the orthogonal polynomial basis, and as in Huggins et al. (2017), we focus on Chebyshev polynomials, which provide uniform quality guarantees over a finite interval [ R, R] for positive R (Figure 2). We can then write

m=0 b(M) m (x(i) θy(i))m

Based on the distribution of noisy inner products obtained through objective perturbation (Figure 4, Appendix C), we choose to work with a degree-2 Chebyshev approximation over the range [ 6, 6], leading to a precise approximation over a range that encompasses most of the observed inner product values.

Proposition 3.4. The logistic regression log-likelihood is approximated by second order Chebyshev polynomial ℓ(θ) nb(2) 0 + b(2) 1 θ] XT y + b(2) 2 (θT XT Xθ), where b(2) 0 , b(2) 1 , and b(2) 2 are constants, and XT X and ] XT y can be retrieved from pairwise marginals as in Proposition 3.2. The maximizer of this approximate log-likelihood

is available in closed form as ˆθDP-cheb = b(2) 1 2 b(2) 2 ( XT X) 1] XT y

The proof is provided in Appendix C. This allows us to define a logistic regression objective where XT X and ] XT y are obtained via Algorithm 1, which is DP by post-processing (Proposition 3.3). Further, the optimal solution to this objective is a simple rescaling of the DD-SSP linear regression solution using these features and labels. It can be shown that the Chebyshev coefficients for ϕ(s) satisfy b(2) 1 > 0 and b(2) 2 < 0, so the

rescaling factor b(2) 1 2 b(2) 2 is positive.

3.4 Connections to synthetic data

Based on the insight that sufficient statistics (or their approximations) can be expressed as linear queries, the proposed framework highlights a novel connection between differentially private synthetic data generation and SSP (Figure 1). Query answering algorithms are often used in synthetic data generation procedures. Many such procedures follow the select-measure-reconstruct approach to synthetic data, where linear queries are privately estimated from noisy measurements, and integrated into a model from which synthetic data can be sampled. This process ensures that the output synthetic data supports the selected linear queries. The synthetic data can then be used downstream to compute any statistic of interest, or in the example of linear regression, we could fit the model by training on the synthetic data. This workflow differs from DD-SSP only in that it consolidates the noisy measurements into a model and samples synthetic data from it.

Published in Transactions on Machine Learning Research (08/2025)

By training select-measure-reconstruct synthetic generative models to preserve the appropriate workload of queries, synthetic data can therefore be tuned for specific machine learning tasks; for example, based on the findings in 3.2 and 3.3, we expect synthetic data that preserves pairwise marginals to perform well on linear and logistic regression, as it implicitly computes the relevant sufficient statistics (or approximate sufficient statistics). These findings provide a new perspective that enhances the understanding and utility of private synthetic data, especially as related to synthetic data for machine learning tasks.

3.5 Generality of the approach

The proposed approach is amenable to generalization on two fronts: i) the flexibility with respect to the choice of private linear query answering method, and ii) the applicability to a variety of models.

Query-answering methods are often used in synthetic data generation algorithms. Virtually any private query answering algorithm, such as Vietri et al. (2020); Mc Kenna et al. (2021b); Aydore et al. (2021), could be adopted in the context of DD-SSP. To give a concrete example, we run supplementary experiments using MST (Mc Kenna et al., 2021b) instead of AIM within our framework (Appendix H). Private-PGM can also be replaced with different methods for model estimation from the private measurements. In Appendix H, we show results for experiments where AIM is combined with the mixture inference step from RAP (Aydore et al., 2021).

The DD-SSP approach can be extended to any model where sufficient statistics can be linked to marginal queries either directly (as we demonstrate for linear regression in 3.2) or via a polynomial approximation (as in logistic regression, 3.3). More detailed extensions of the method are discussed in Section 5.

3.6 Baselines

We compare DD-SSP to synthetic data method AIM-Synth, where private measurements are used to estimate the underlying distribution, and surrogate data is sampled from it. We train AIM-Synth with an input workload of all pairwise marginals to match the workload utilized for DD-SSP. Since marginal-based synthetic data is designed to preserve the same linear queries that are sufficient to solve linear regression, or approximately sufficient for logistic regression, the expectation is that DD-SSP will closely match the performance of AIM-Synth. The difference between these approaches is whether linear queries for sufficient statistics are computed directly from marginals estimated by the mechanism, or computed from synthetic data (Figure 1).

We also compare both methods against established DP baselines. Since our methods are based on privately reconstructing sufficient statistics, for DP linear regression sufficient statistic perturbation (SSP) is the natural baseline choice. We choose Ada SSP (Wang, 2018) for its competitive performance. Ada SSP uses limited data-adaptivity to add a ridge penalty based on an estimated bound on the eigenvalues of XT X, but then adds independent noise to each entry of the sufficient statistics, unlike fully data-adaptive query-answering mechanisms. The Ada SSP algorithm is detailed in Appendix D. For logistic regression, objective perturbation (Obj Pert) is a widely adopted solution originally proposed by Chaudhuri et al. (2011), and further refined by Kifer et al. (2012) where it is extended to (ϵ, δ)-DP, with more general applicability and improved guarantees. Algorithm and details are provided in Appendix E. Both Ada SSP and Obj Pert determine how much noise to add based on X , which is the upper bound to the L2-norm of any row of X (Section 3.1). For example, in Ada SSP, XT X is noise-perturbed proportionally to X 2. From Proposition 3.1, we can set X 2 = U 2 + c where U is a bound on the numerically-encoded features and c is the number of one-hot-encoded attributes to obtain a tight sensitivity bound for these baselines.

In addition to our proposed methods, we compare against DP-SGD (Abadi et al., 2016), a widely adopted algorithm for differentially private training. DP-SGD is highly sensitive to the choice of hyperparameters and requires extensive tuning to achieve optimal performance. To accurately account for the privacy loss incurred during hyperparameter tuning, advanced privacy accounting techniques must be applied (Ponomareva et al., 2023). To illustrate the variability in DP-SGD s performance, our plots depict a shaded region representing the range between two versions of DP-SGD: an optimistic baseline that disregards the privacy cost of hyperparameter tuning, artificially inflating performance, and a more realistic version that incorporates this cost using advanced composition (Steinke, 2022). The effective performance of DP-SGD in practical scenarios,

Published in Transactions on Machine Learning Research (08/2025)

including with more advanced privacy accounting (Papernot and Steinke, 2021), is expected to lie within this range.

3.7 Limitations

When choosing AIM as the mechanism for DP marginals, we are limited to working with discrete data, which is a requirement in AIM itself. Thus, our comparisons to other regression methods are scoped to discrete numerical data. Future work may consider DD-SSP with other mechanisms that support continuous data without discretization. For the logistic regression approximation, we use Chebyshev second-order polynomials; other approximation functions and/or degrees of precision could be evaluated. As shown in section G, DD-SSP demonstrates overall gains in regression accuracy compared to baseline methods, however this improvement comes at the cost of increased computational time tied to the complexity involved in privately releasing private marginals for large domains with high utility. This cost can be significantly reduced by replacing AIM with faster marginal-releasing methods, which we demonstrate in Appendix H. More detail on the computational runtime is discussed in Appendix G, including a few accessible strategies to mitigate the computational costs in real-world applications.

4 Experiments

In our experiments, we evaluate the effectiveness of DD-SSP on both linear and logistic regression tasks.1

For linear regression, our main goal is to demonstrate the effectiveness of data-dependence for SSP. We therefore compare primarily against Ada SSP, a standard SSP baseline. For reference, we also compare to the public solution and to DP-SGD (Abadi et al., 2016), a widely used non-SSP method, as a representative of this class of methods; several other non-SSP approaches have been recently proposed for differentially private linear regression, including model selection via approximate Tukey depth (Amin et al., 2022), and iterative private gradient descent (Brown et al., 2024). For SSP, Tang et al. (2024) recently integrated gradient boosting and clipping with Ada SSP with the primary goal of mitigating the performance impacts due to data clipping when tight bounds on the range of the data are not known in advance; this is largely orthogonal to our goal of improving SSP via data-dependence in selecting and measuring queries. For logistic regression, we benchmark DD-SSP against Obj Pert and DP-SGD. Both are widely adopted methods for logistic regression under differential privacy, with the caveat that DP-SGD has strong dependence on hyperparameter tuning as we later further discuss. We also assess the performance of AIM-generated synthetic data (AIM-Synth), observing that it closely mirrors DD-SSP across tasks, reinforcing our hypothesis that data-dependent estimation of sufficient statistics underlies the strong performance of marginal-based synthetic data in machine learning settings.

We compare the Mean Squared Error (MSE) of DP query-based methods DD-SSP and AIM-Synth against the DP baseline Ada SSP and the public baseline for ϵ {0.05, 0.1, 0.5, 1.0, 2.0}, with a fixed δ = 10 5. Figure 3 shows that DD-SSP and AIM-Synth have nearly identical performance and both improve significantly upon Ada SSP on all datasets except ACSIncome, where performance is similar. For logistic regression, DD-SSP closely matches AIM-Synth and surpasses Obj Pert in low-ϵ regimes while being competitive at higher ϵ values. Assessing DD-SSP vs. DP-SGD is more challenging due to its reliance on hyperparameter fine-tuning. We represent this variability with a shaded region spanning the case where DP-SGD accounts for the privacy cost of hyperparameter tuning via advanced composition (Steinke, 2022), and the case where this cost is ignored. Our results show that DD-SSP is competitive with DP-SGD under advanced composition, with performance varying across datasets. Additional details on experimental setup, datasets, and implementation can be found in Appendix G.

Based on these results, we observe the following:

DD-SSP is a competitive option for DP linear and logistic regression, surpassing data-independent SSP and Obj Pert baselines in specific cases, and being competitive with DP-SGD overall.

1All experiment code is available at https://github.com/ceciliaferrando/DD-SSP.

Published in Transactions on Machine Learning Research (08/2025)

10 1 100 2 10 1

ACSincome-LIN

Public DD-SSP AIM-Synth DP-SGD Ada SSP

(a) Linear regression MSE results.

10 1 100 7.4 10 1

9 10 1 ACSincome

7 10 1 ACSmobility

ACSemployment

10 1 100 5 10 1

ACSPublic Coverage

Public DD-SSP AIM-Synth DP-SGD Obj Pert

(b) Logistic regression AUC results.

Figure 3: Performance comparison across linear and logistic regression. Standard error bars are computed over 5 trials. For DP-SGD, the shaded region spans the range between an optimistic baseline ignoring hyperparameter tuning cost, and a realistic baseline using advanced composition to account for this cost (this line is highlighted by star markers). The advanced composition curve starts further to the right as the lowest ϵ achievable by DP-SGD in these cases is 0.005.

Published in Transactions on Machine Learning Research (08/2025)

The performance of AIM-Synth suggests that estimating problem-specific data-dependent sufficient statistics explains the suitability of AIM synthetic data for machine learning tasks. This implies DD-SSP is effective whenever pairwise marginals are available. The approximate DD-SSP method for logistic regression constitutes a novel DP algorithm as an alternative to privatized ERM procedures.

5 Future work

We imagine extensions to two sets of models. Beyond linear regression, all exponential family distributions, including graphical models like Naive Bayes, have finite sufficient statistics, and for such models we can devise similar DD-SSP solutions, or tailor synthetic data, by identifying a workload that supports the estimation of their sufficient statistics. Additionally, future work can focus on developing approximate loss functions with finite sufficient statistics for a broader class of other models, including generalized linear models (GLMs), where our results for logistic regression can be extended to obtain approximate sufficient statistics from k-way marginals by making a k-degree polynomial approximation to the GLM mapping function following the reasoning of Huggins et al. (2017). This will open the door to novel DD-SSP methods that directly minimize the approximate loss functions, and improve utility by adding privacy noise in a data-dependent way. Examples of such extensions are outlined in Appendix F. Additionally, methods targeting encoded workload Aj µj,k AT k instead of µj,k can be explored and combined with advanced data-independent mechanisms like the matrix mechanism, potentially leading to a new class of encoding-aware SSP methods.

6 Conclusions

We introduce methods for data-dependent sufficient statistic perturbation (DD-SSP). Our methods use privately released marginal tables to solve linear and logistic regression via sufficient statistics. We find that DD-SSP performs better than data-independent SSP on linear regression and objective perturbation for logistic regression, and is competitive with DP-SGD, known to achieve excellent results under fine-tuned hyperparameter setting.

Notably, the approximate DD-SSP logistic regression algorithm is the first DP logistic regression method that allows analysts to solve logistic regression via a SSP algorithm, directly minimizing the approximate loss function. Additionally, we find that the performance of DD-SSP is almost indistinguishable from that of AIM synthetic data: this suggests that with the appropriate workload, training these machine learning models on query-based DP synthetic data corresponds to data-dependent SSP.

M. Abadi, A. Chu, I. Goodfellow, H. B. Mc Mahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308 318, 2016.

K. Amin, M. Joseph, M. Ribero, and S. Vassilvitskii. Easy differentially private linear regression. ar Xiv preprint ar Xiv:2208.07353, 2022.

S. Aydore, W. Brown, M. Kearns, K. Kenthapadi, L. Melis, A. Roth, and A. A. Siva. Differentially private query release through adaptive projection. In International Conference on Machine Learning, pages 457 467. PMLR, 2021.

R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th annual symposium on foundations of computer science, pages 464 473. IEEE, 2014.

B. Becker and R. Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20.

Published in Transactions on Machine Learning Research (08/2025)

G. Bernstein and D. R. Sheldon. Differentially private bayesian linear regression. Advances in Neural Information Processing Systems, 32, 2019.

G. Brown, K. Dvijotham, G. Evans, D. Liu, A. Smith, and A. Thakurta. Private gradient descent for linear regression: Tighter error bounds and instance-specific uncertainty estimation. ar Xiv preprint ar Xiv:2402.13531, 2024.

M. Bun and T. Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635 658. Springer, 2016.

K. Chaudhuri and C. Monteleoni. Privacy-preserving logistic regression. Advances in neural information processing systems, 21, 2008.

K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(3), 2011.

C. Dimitrakakis, B. Nelson, Z. Zhang, A. Mitrokotsa, and B. I. Rubinstein. Differential privacy for bayesian inference through posterior sampling. Journal of machine learning research, 18(11):1 39, 2017.

B. Ding, M. Winslett, J. Han, and Z. Li. Differentially private data cubes: optimizing noise sources and consistency. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 217 228, 2011.

F. Ding, M. Hardt, J. Miller, and L. Schmidt. Retiring adult: New datasets for fair machine learning. Advances in neural information processing systems, 34:6478 6490, 2021.

C. Dwork and A. Smith. Differential privacy for statistics: What we know and what we want to learn. Journal of Privacy and Confidentiality, 1(2), 2010.

C. Dwork, F. Mc Sherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265 284. Springer, 2006.

C. Dwork, A. Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, 2014.

C. Ferrando, S. Wang, and D. Sheldon. Parametric bootstrap for differentially private confidence intervals. In International Conference on Artificial Intelligence and Statistics, pages 1598 1618. PMLR, 2022.

R. A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, 222(594-604): 309 368, 1922.

J. Foulds, J. Geumlek, M. Welling, and K. Chaudhuri. On the theory and practice of privacy-preserving bayesian data analysis. ar Xiv preprint ar Xiv:1603.07294, 2016.

M. Gaboardi, E. J. G. Arias, J. Hsu, A. Roth, and Z. S. Wu. Dual query: Practical private query release for high dimensional data. In International Conference on Machine Learning, pages 1170 1178. PMLR, 2014.

R. M. Gower and F. Bach. Properties and examples of convexity and smoothness, February 2019. URL https://gowerrobert.github.io/pdf/M2_statistique_optimisation/exe_convexity_ smoothness_solution.pdf.

L. Grégoire, C. Task, I. Slavitt, G. Nicolas, K. Bhagat, D. Streat, and G. Howarth. Sdnist: Benchmark data and evaluation tools for data sythesizers, Dec. 2021.

M. Hardt, K. Ligett, and F. Mc Sherry. A simple and practical algorithm for differentially private data release. Advances in neural information processing systems, 25, 2012.

M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of differentially-private histograms through consistency. ar Xiv preprint ar Xiv:0904.0942, 2009.

Published in Transactions on Machine Learning Research (08/2025)

J. Huggins, R. P. Adams, and T. Broderick. Pass-glm: polynomial approximate sufficient statistics for scalable bayesian glm inference. Advances in Neural Information Processing Systems, 30, 2017.

P. Jain and A. Thakurta. Differentially private learning with kernels. In International conference on machine learning, pages 118 126. PMLR, 2013.

J. Jordon, J. Yoon, and M. Van Der Schaar. Pate-gan: Generating synthetic data with differential privacy guarantees. In International conference on learning representations, 2019.

D. Kifer, A. Smith, and A. Thakurta. Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory, pages 25 1. JMLR Workshop and Conference Proceedings, 2012.

T. Kulkarni, J. Jälkö, A. Koskela, S. Kaski, and A. Honkela. Differentially private bayesian inference for generalized linear models. In International Conference on Machine Learning, pages 5838 5849. PMLR, 2021.

C. Li and G. Miklau. An adaptive mechanism for accurate query answering under differential privacy. ar Xiv preprint ar Xiv:1202.3807, 2012.

C. Li, M. Hay, V. Rastogi, G. Miklau, and A. Mc Gregor. Optimizing linear counting queries under differential privacy. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 123 134, 2010.

C. Li, M. Hay, G. Miklau, and Y. Wang. A data-and workload-aware algorithm for range queries under differential privacy. ar Xiv preprint ar Xiv:1410.0265, 2014.

C. Li, G. Miklau, M. Hay, A. Mc Gregor, and V. Rastogi. The matrix mechanism: optimizing linear counting queries under differential privacy. The VLDB journal, 24:757 781, 2015.

T. Liu, G. Vietri, and S. Z. Wu. Iterative methods for private synthetic data: Unifying framework and new methods. Advances in Neural Information Processing Systems, 34:690 702, 2021.

R. Mc Kenna, D. Sheldon, and G. Miklau. Graphical-model based estimation and inference for differential privacy. In International Conference on Machine Learning, pages 4435 4444. PMLR, 2019.

R. Mc Kenna, G. Miklau, M. Hay, and A. Machanavajjhala. Hdmm: Optimizing error of high-dimensional statistical queries under differential privacy. ar Xiv preprint ar Xiv:2106.12118, 2021a.

R. Mc Kenna, G. Miklau, and D. Sheldon. Winning the nist contest: A scalable and general approach to differentially private synthetic data. ar Xiv preprint ar Xiv:2108.04978, 2021b.

R. Mc Kenna, B. Mullins, D. Sheldon, and G. Miklau. Aim: An adaptive and iterative mechanism for differentially private synthetic data. ar Xiv preprint ar Xiv:2201.12677, 2022.

F. Mc Sherry and I. Mironov. Differentially private recommender systems: Building privacy into the netflix prize contenders. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 627 636, 2009.

N. Papernot and T. Steinke. Hyperparameter tuning with renyi differential privacy. ar Xiv preprint ar Xiv:2110.03620, 2021.

N. Ponomareva, H. Hazimeh, A. Kurakin, Z. Xu, C. Denison, H. B. Mc Mahan, S. Vassilvitskii, S. Chien, and A. G. Thakurta. How to dp-fy ml: A practical guide to machine learning with differential privacy. Journal of Artificial Intelligence Research, 77:1113 1201, 2023.

W. Qardaji, W. Yang, and N. Li. Priview: practical differentially private release of marginal contingency tables. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 1435 1446, 2014.

Published in Transactions on Machine Learning Research (08/2025)

D. Ridgeway, M. F. Theofanos, T. W. Manley, and C. Task. Challenge design and lessons learned from the 2018 differential privacy challenges. NIST technical note, 2151, 2021.

L. Rosenblatt, X. Liu, S. Pouyanfar, E. de Leon, A. Desai, and J. Allen. Differentially private synthetic data: Applied evaluations and enhancements. ar Xiv preprint ar Xiv:2011.05537, 2020.

O. Sheffet. Differentially private ordinary least squares. In International Conference on Machine Learning, pages 3105 3114. PMLR, 2017.

T. Steinke. Composition of differential privacy & privacy amplification by subsampling. ar Xiv preprint ar Xiv:2210.00597, 2022.

S. Tang, S. Aydore, M. Kearns, S. Rho, A. Roth, Y. Wang, Y.-X. Wang, and Z. S. Wu. Improved differentially private regression via gradient boosting. In 2024 IEEE Conference on Secure and Trustworthy Machine Learning (Sa TML), pages 33 56. IEEE, 2024.

G. Vietri, G. Tian, M. Bun, T. Steinke, and S. Wu. New oracle-efficient algorithms for private synthetic data release. In International Conference on Machine Learning, pages 9765 9774. PMLR, 2020.

G. Vietri, C. Archambeau, S. Aydore, W. Brown, M. Kearns, A. Roth, A. Siva, S. Tang, and Z. S. Wu. Private synthetic data for multitask learning and marginal queries. ar Xiv preprint ar Xiv:2209.07400, 2022.

D. Vu and A. Slavkovic. Differential privacy for clinical trial data: Preliminary evaluations. In 2009 IEEE International Conference on Data Mining Workshops, pages 138 143. IEEE, 2009.

Y.-X. Wang. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. ar Xiv preprint ar Xiv:1803.02596, 2018.

Y.-X. Wang, S. Fienberg, and A. Smola. Privacy for free: Posterior sampling and stochastic gradient monte carlo. In International Conference on Machine Learning, pages 2493 2502. PMLR, 2015.

Y. Xiao, L. Xiong, L. Fan, and S. Goryczka. Dpcube: Differentially private histogram release through multidimensional partitioning. ar Xiv preprint ar Xiv:1202.5358, 2012.

L. Xie, K. Lin, S. Wang, F. Wang, and J. Zhou. Differentially private generative adversarial network. ar Xiv preprint ar Xiv:1802.06739, 2018.

J. Xu, Z. Zhang, X. Xiao, Y. Yang, G. Yu, and M. Winslett. Differentially private histogram publication. The VLDB journal, 22:797 822, 2013.

G. Yaroslavtsev, G. Cormode, C. M. Procopiuc, and D. Srivastava. Accurate and efficient private release of datacubes and contingency tables. In 2013 IEEE 29th International Conference on Data Engineering (ICDE), pages 745 756. IEEE, 2013.

J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, and X. Xiao. Privbayes: Private data release via bayesian networks. ACM Transactions on Database Systems (TODS), 42(4):1 41, 2017.

X. Zhang, R. Chen, J. Xu, X. Meng, and Y. Xie. Towards accurate histogram publication under differential privacy. In Proceedings of the 2014 SIAM international conference on data mining, pages 587 595. SIAM, 2014.

Z. Zhang, B. Rubinstein, and C. Dimitrakakis. On the differential privacy of bayesian inference. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30,1, 2016.

Published in Transactions on Machine Learning Research (08/2025)

For AIM, we follow the original algorithm in Mc Kenna et al. (2022). AIM uses an intelligent initialization step to estimate one-way marginals. This results in a model where all one-way marginals are preserved well, and higher-order marginals can be estimated under an independence assumption. Additionally, AIM uses a carefully chosen subset of the marginal queries and leverages the observation that lower-dimensional marginals exhibit a better signal-to-noise ratio than marginals with many attributes and low counts, and at the same time they can be used to estimate higher-dimensional marginal queries in the workload. Finally, the quality score function for selecting marginals to measure ensures that the selection is budget-adaptive , i.e. it measures larger dimensional marginals only when the available privacy budget is large enough.

Algorithm 2 AIM (Mc Kenna et al., 2022)

1: Input: Dataset D, workload W, privacy parameter ρ 2: Output: Synthetic Dataset D 3: Hyper-Parameters: MAX-SIZE=80MB, T = 16d, α = 0.9 4: σ0 = p

T/(2 α ρ) 5: ρused = 0

6: t = 0 7: Initialize ˆpt (using Algorithm 3) 8: wr = P

s W cs | r s |

9: σt+1 σ0 ϵt+1 p

8(1 α)ρ/T 10: while ρused < ρ do 11: t = t + 1 12: ρused ρused + 1

8ϵ2 t + 1 2σ2 t 13: Ct = rt W+ | Junction Tree-SIZE(r1, . . . , rt)) ρused

ρ MAX-SIZE 14: select rt Ct using the exponential mechanism with:

qr(D) = wr Mr(D) Mr(ˆpt 1) 1 p

15: measure marginal on rt: yt = Mrt(D) + N(0, σ2 t I)

16: estimate data distribution using Private-PGM:

ˆpt = arg min p S

1 σi Mri(p) yi 2 2

17: anneal ϵt+1 and σt+1 using Algorithm 4 18: end while 19: generate synthetic data D from ˆpt using Private-PGM 20: return D

Published in Transactions on Machine Learning Research (08/2025)

Algorithm 3 Initialize pt (Subroutine of Algorithm 2) (Mc Kenna et al., 2022)

1: for r {r W+ | |r| = 1} do 2: t t + 1 3: σt σ0 4: rt r

5: yt Mr(D) + N(0, σ2 t I) 6: ρused ρused + 1 2σ2 t 7: end for 8: ˆpt argminp S Pt i=1 1 σi Mri(p) yi 2 2

Algorithm 4 Budget Annealing (Subroutine of Algorithm 2) (Mc Kenna et al., 2022)

1: if Mrt(ˆpt) Mrt(ˆpt 1) 1 q

2 π σt nrt then

2: ϵt+1 2 ϵt 3: σt+1 σt/2

4: else 5: ϵt+1 ϵt 6: σt+1 σt 7: end if 8: if (ρ ρused) 2 1 2σ2 t+1 + 1

8ϵ2 t+1 then

8 (1 α) (ρ ρused)

1 2 α (ρ ρused) 11: end if

B Encoding example

The following example demonstrates the application of the encoding strategy in 3.1. The mapping is:

χj 7 Ajψj(χj)

For a concrete example, suppose there are 5 levels (i.e., mj = 5) and the feature value is χj = 3, then the one-hot encoding vector is

For the numerical encoding we use Aj = v T j . For example, suppose v T j = [1, 2, 4, 8, 16], then in our example we have

χj 7 v T j ψj(χj) = [1, 2, 4, 8, 16]

It is easy to see that the numerical value is always equal to vj[χj], i.e., value in vector vj at index χj. In other words, the vector vj enumerates the numerical values for each level. For the one-hot encoding case, we use Aj = Ij, the identity matrix. In our example, this gives

χj 7 Imjψj(χj) = ψj(χj) =

Published in Transactions on Machine Learning Research (08/2025)

It is clear that this gives the one-hot encoding. The reduced one-hot encoding is similar.

C Logistic Regression log-likelihood approximation

Proof. The log-likelihood for logistic regression can be expressed as

i=1 ϕ(x(i) θy(i))

where ϕ(s) := log(1 + e s). Based on Huggins et al. (2017), we can approximate the logistic regression log-likelihood with a Chebyshev polynomial approximation of degree M:

ϕ(s) ϕM(s) :=

m=0 bm (M)sm

where b(M) j are constants. Then,

m=0 b(M) m (x(i) θy(i))m

If we choose M = 2,

i=1 b(2) 0 + b(2) 1 (x(i) θy(i)) + b(2) 2 (x(i) θy(i))2

The quadratic term is

i=1 (x(i) θy(i))2 =

i=1 y(i)2 d X

j,k=1 x(i) j x(i) k θjθk =

i=1 y(i)2x(i) j x(i) k

Therefore we can rewrite the approximate log-likelihood as

ℓ(θ) nb(2) 0 + b(2) 1

i=1 x(i) θy(i) + b(2) 2

i=1 y(i)2x(i) j x(i) k

nb(2) 0 + b(2) 1 y T Xθ + b(2) 2

i=1 x(i) j x(i) k

nb(2) 0 + b(2) 1 y T Xθ + b(2) 2 θT XT Xθ

where the simplification in the second line follows from the fact that we work with y(i) {1, 1} and y(i)2 = 1 for any i.

After moving constants, an equivalent objective for maximization w.r.t. θ is:

ℓ(θ) b(2) 2 b(2) 1 θT XT X θ + θT XT y + const.

For comparison, the linear-regression log-likelihood under a Gaussian noise model is:

ℓlinear(θ) = 1

2 θT XT X θ + θT XT y + const.

Published in Transactions on Machine Learning Research (08/2025)

It can be shown that the Chebyshev coefficients for ϕ(s) satisfy b(2) 1 > 0 and b(2) 2 < 0, so both functions are concave and quadratic, with terms having the same signs, and the only difference being the relative weighting of the quadratic and linear terms. A simple derivation (set the gradient of ℓto zero and solve for θ assuming XT X is invertible) shows that ℓ(θ) has the following minimizer, which is a positive rescaling of the OLS solution:

ˆθcheb = b(2) 1 2 b(2) 2 (XT X) 1XT y.

The sufficient statistics XT X and XT y can be derived following the same proposed strategies as in linear

regression (Section 3.2), obtaining marginal query-based estimates XT X and ] XT y. We can then express the closed-form solution of the approximate log-likelihood with approximate sufficient statistics as:

ˆθDP-cheb = b(2) 1 2 b(2) 2 ( XT X) 1] XT y.

Figure 4 shows the distribution of the inner DP products for the datasets used in the experiments, supporting the choice of [ 6, 6] for the range of the Chebyshev approximation.

Figure 4: Distribution of the inner DP products y(i)x(i), ˆθDP for the datasets of interest. ˆθDP is computed using the objective perturbation method with ϵ = 1 and δ = 10 5. The dashed vertical lines represenent the [-6, 6] bounds chosen for the Chebyshev approximation.

Published in Transactions on Machine Learning Research (08/2025)

D Linear regression baseline

Algorithm 5 outlines the Ada SSP method for linear regression (Wang, 2018).

Algorithm 5 Ada SSP (Wang, 2018)

1: Input: Data X, y. Privacy budget: ϵ, δ. Bounds: X , Y , ρ (0, 1) (0.05 in the paper) 2: 1. Calculate the minimum eigenvalue λmin XT X .

3: 2. Privately release λmin = max λmin +

ϵ/3 X 2Z log(6/δ)

ϵ/3 X 2, 0 , where Z N(0, 1).

4: 3. Set λ = max 0,

d log(6/δ) log(2d2/ρ) X 2

5: 4. Privately release \ XT X = XT X +

log(6/δ) X 2

ϵ/3 Z for Z Rd d is a symmetric matrix and every element from the upper triangular matrix is sampled from N(0, 1).

6: 5. Privately release [ XT y = XT y +

log(6/δ) X Y

ϵ/3 Z for Z N (0, Id).

7: Output: θ = \ XT X + λI 1 [ XT y.

To reason about the sensitivity of XT X, consider two neighboring datasets X Rn d and X R(n+1) d

differing by one data entry v X, where v is a d 1 vector. Then,

XT X = sup X X f(X ) f(X) F

Since X and X only differ by one row (v), then f(X ) f(X) = vv T (Sheffet, 2017).

So the sensitivity is maximum over v of vec(vv T ) = vv T F . We have

2 XT X = sup v X vv T 2 F

j=1 (vivj)2

= sup v X v 4

where X is the greatest possible norm of a vector in the domain X. Therefore,

XT X = X 2.

The sensitivity of XT y can be similarly derived. Given neighboring datasets X Rn d, y Rn, and X R(n+1) d, y Rn+1, where v X Rd is the new row, and w Y R is the new value in y . Then,

f (X , y ) f(X, y) = X T y XT y = wv

Since X = supx X x and Y = supy Y |y|, we have

XT y = sup (X,y) (X ,y ) f (X , y ) f(X, y)

= sup w Y,v X |w| v = Y X

Published in Transactions on Machine Learning Research (08/2025)

E Logistic regression baseline

Our DP logistic regression baseline is based on the generalized objective perturbation algorithm in Kifer et al. (2012) (Algorithm 6). In this section, to match the notation in Kifer et al. (2012), ℓ(θ; z) = log(1+exp( x θy)) is the loss for a single datum and ˆL(θ; D) = 1

n Pn i=1 ℓ(θ; z(i)) is the average loss over the dataset.

Algorithm 6 Generalized Objective Perturbation Mechanism (Obj Pert) (Kifer et al., 2012)

Require: dataset D = z(1), . . . , z(n) , where z(i) = (x(i), y(i)), privacy parameters ϵ and δ (δ = 0 for ϵ-differential privacy), bound X on the L2 norm of any x entry, convex regularizer r, a convex domain F Rd, convex loss function ˆL(θ; D) = 1

n Pn i=1 ℓ θ; z(i) , with continuous Hessian, ℓ(θ; z) X

(for all z D and θ F), and the eigenvalues of 2ℓ(θ; z) bounded by X 2

4 (for all z and for all θ F).

2: Sample b Rd from ν2(b; ϵ, δ, X ) = N 0, X 2(8 log 2

δ +4ϵ) ϵ2 Id

3: ˆθDP arg minθ F ˆL(θ; D) + 1

2n θ 2 + b T θ

The algorithm requires the following bounds for the gradient and Hessian of ℓ:

λmax( 2ℓ(θ; z)) X 2

To reason about the sensitivity bounds, let ϕ(s) = log(1 + es). Then we can write ℓ(θ; z) = ϕ( x θy). Following Gower and Bach (2019), it is straightforward to derive that

(1 + es)2 1

and clear that both quantities are non-negative. Then the gradient of ℓis:

ℓ(θ; z) = θϕ( x θy) = ϕ ( x θy) yx

The norm is bounded as

ℓ(θ; z) = |ϕ ( x θy)| |y| x X

where the inequality holds since |ϕ (s)| 1 for all s and |y| = 1.

By differentiating the gradient again and using the fact that y2 = 1, we can derive the Hessian as:

2ℓ(θ; z) = ϕ ( x θy)xx T

The maximum eigenvalue is

λmax( 2ℓ(θ; z)) = ϕ ( x θy)λmax(xx T )

= ϕ ( x θy) x 2

In the second line, we used the fact that λmax(xx T ) = x 2. To see this, note that xx T is rank one and (xx T )x = x 2x, therefore x is an eigenvector with eigenvalue x 2 and this is the largest eigenvalue. In the last line we used that ϕ (s) 1

4 for all s and that x X .

Published in Transactions on Machine Learning Research (08/2025)

F Examples of Extensions to GLMs and Higher-Order Approximations

F.1 Degree-k Polynomials from k-way Marginals

In this Section, we show that in general a loss function approximated by a degree-k polynomial can be computed from the set of all k-way marginals.

Consider a loss function ℓ(θ) = Pn i=1 ϕ(x(i), θ) where ϕ(x, θ) is a degree-three polynomial in x. By breaking the polynomial into monomials, the loss function can be broken into a sum of terms, a generic one of which is Pn i=1 C x(i) a x(i) b x(i) c for some constant C and indices a, b, c into the encoded feature vector. Recall that encoded features arise from linear transformations of one-hot encoded attributes via the mappings χj 7 Ajψj(χj). Suppose features xa, xb, xc come from attributes χj, χk, χℓ, respectively. Then there are constant vectors u Rmj, v Rmk, w Rmℓ(each a row of Aj, Ak, Aℓ, respectively) such that

xa = u ψj(χj)

xb = v ψk(χk)

xc = w ψℓ(χℓ)

We can now rewrite the term from the loss function (ignoring the constant) as

i=1 x(i) a x(i) b x(i) c =

u ψj(χ(i) j ) v ψk(χ(i) k ) w ψℓ(χ(i) ℓ)

r=1 ur I[χ(i) j = r] mk X

s=1 vs I[χ(i) k = s] mℓ X

t=1 wt I[χ(i) ℓ = t]

t=1 urvswt I[χ(i) j = r]I[χ(i) k = s]I[χ(i) ℓ = t]

i=1 I[χ(i) j = r, χ(i) k = s, χ(i) ℓ = t]

t=1 urvswt µj,k,ℓ rst

= u v w , µj,k,ℓ

In the second-to-last line, we introduced the notation µj,k,ℓ Rmj mk mℓfor the array containing the three-dimensional data marginal, with entries

µj,k,ℓ rst =

i=1 I[χ(i) j = r, χ(i) k = s, χ(i) ℓ = t].

In the last line, we rewrote the sum as an inner product between the constant tensor u v w, derived from the feature encoding, and the marginal µj,k,ℓ , to emphasize the general form of the operation. More generally, a loss function term that depends on at most k features, like a degree-k monomial, can be rewritten as a tensor inner product between a constant tensor and a k-way marginal. Thus, if the overall loss function is a degree-k polynomial, it can be computed from the set of all k-way marginals.

F.2 Poisson Regression (Degree-2 Chebyshev Approximation, Log-Link)

Consider a Poisson GLM with canonical log-link function:

y(i) Poisson(µ(i)), η(i) = log(µ(i)) = x(i) θ.

Published in Transactions on Machine Learning Research (08/2025)

The standard Poisson regression log-likelihood is:

h y(i)(x(i) θ) ex(i) θ log(y(i)!) i .

Ignoring the constant term log(y(i)!) (which does not depend on θ), we have:

h y(i)(x(i) θ) ex(i) θi .

If we approximate the exponential function es using a second-degree Chebyshev polynomial approximation around a chosen interval (e.g. [ 6, 6]) as

es c(2) 0 + c(2) 1 s + c(2) 2 s2,

where c(2) 0 , c(2) 1 , c(2) 2 are Chebyshev polynomial coefficients determined numerically, then, substituting this approximation into the Poisson log-likelihood gives:

h y(i)(x(i) θ) c(2) 0 + c(2) 1 (x(i) θ) + c(2) 2 (x(i) θ)2 i .

Expanding explicitly and simplifying, we obtain:

ℓ(θ) = (XT y)T θ c(2) 0 n c(2) 1 1T Xθ c(2) 2 θT (XT X)θ,

where XT X, XT y, and 1T X can be computed from all pairwise marginals (see Section F.1).

F.3 Logistic Regression (Degree-4 Chebyshev Approximation)

The logistic regression log-likelihood (with labels y(i) { 1, 1}) is:

i=1 log 1 1 + e x(i) θy(i)

i=1 log 1 + e x(i) θy(i) .

We approximate the logistic function ϕ(s) = log(1 + e s) using a degree-4 Chebyshev polynomial:

m=0 b(4) m sm,

where the constants b(4) m are determined numerically by fitting Chebyshev polynomials over a chosen interval (e.g. [ 6, 6]). Substituting this approximation into the log-likelihood, we obtain:

m=0 bm x(i) θy(i) m .

Expanding explicitly, this yields:

ℓ(θ) b0n + b1θT (XT y) + b2θT (XT X)θ + b3

i=1 y(i)(θT x(i))3 + b4

i=1 (θT x(i))4.

The required sufficient statistics are XT y, XT X, Pn i=1 y(i)(x(i)) 3, and Pn i=1(x(i)) 4. These can be computed from 4-way marginals (see Section F.1).

Published in Transactions on Machine Learning Research (08/2025)

Note that measuring higher-order marginals under differential privacy poses both computational and statistical challenges. From a computational standpoint, the domain size of a marginal grows exponentially with its order, making both private measurement and post-processing increasingly expensive. Statistically, higher-order marginals tend to suffer from lower signal-to-noise ratios, as the added noise (due to higher sensitivity and larger output space) can overwhelm the utility of the measurement. As noted in the AIM paper (Mc Kenna et al., 2022), lower-order marginals often provide a better trade-off, and can still be leveraged to estimate higher-order interactions via structured inference methods, such as graphical models.

G Experiment details

G.1 Datasets and Preprocessing

We use the following datasets:2

Adult (Becker and Kohavi, 1996): The target variable is num-education (number of education years) for linear regression and income>50K for logistic regression.

Fire (Ridgeway et al., 2021): The target variable is Priority (of the call).

Taxi (Grégoire et al., 2021): The target variable is totalamount (total fare amount).

ACS Datasets (Ding et al., 2021): Data is queried for California (2018). Includes binary classification tasks for PINCP (income above $50k), MIG (mobility), ESR (employment), and PUBCOV (public coverage). ACSincome is also used for linear regression with the target variable PINCP (income) discretized into 20 bins.

More detail on the datasets is included in Table 1. Data is shuffled and split into 1,000 test points and up to 50,000 training points. Non-numerical features are one-hot encoded, dropping the first level to avoid multi-collinearity, and numerical features are rescaled to [ 1, 1] for noise allocation (see Section 3.1 for more detail on data encoding).

G.2 Methodology

AIM training: AIM is trained with a model size of 200MB, a maximum of 1,000 iterations, and a workload of all pairwise marginals. For Ada SSP, sensitivity is calibrated as described in Sections 3.1 and 3.6.

DP-SGD fine tuning and training: DP-SGD s hyperparameters are fine-tuned by running a gridsearch for the best parameter. The search space spans the following values:

Batch size: [n, 1024, 256]

Gradient clipping norm: [0.01, 0.1, 0.2]

Number or epochs: [1, 10, 20]

Learning rate: [0.001, 0.01, 0.1, 1.0]

Advanced composition is used to account for hyperparameter tuning costs as per Theorem 22 in Steinke (2022). The optimistic baseline ignores this cost entirely.

2The ACS data is sourced from https://github.com/socialfoundations/folktables. All other datasets are sourced from https://github.com/ryan112358/hd-datasets.

Published in Transactions on Machine Learning Research (08/2025)

Table 1: Dataset information

Dataset Size # Attributes

Attributes Target

Adult 48,842 15 [ age , workclass , fnlwgt , education , marital-status , occupation , relationship , race , sex , capitalgain , capital-loss , hours-per-week , native-country , income>50K , education-num ]

income>50K (logistic), educationnum (linear)

ACSIncome 195,665 9 [ AGEP , COW , SCHL , MAR , RELP , WKHP , SEX , RAC1P , PINCP ]

Fire 305,119 15 [ ALS Unit , Battalion , Call Final Disposition , Call Type , Call Type Group , City , Final Priority , Fire Prevention District , Neighborhooods - Analysis Boundaries , Original Priority , Station Area , Supervisor District , Unit Type , Zipcode of Incident , Priority ]

Taxi 1,048,575 11 [ Vendor ID , passengercount , tripdistance , Ratecode ID , PULocation ID , DOLocation ID , paymenttype , fareamount , tipamount , tollsamount , totalamount ]

totalamount

ACSmobility 29,358 20 [ AGEP , SCHL , MAR , SEX , DIS , CIT , MIL , ANC , WKHP , NATIVITY , RELP , DEAR , DEYE , DREM , RAC1P , GCL , COW , ESR , JWMNP , PINCP ]

ACSEmployment 378,817 17 [ AGEP , SCHL , MAR , RELP , DIS , ESP , CIT , MIG , MIL , ANC , NATIVITY , DEAR , DEYE , DREM , SEX , RAC1P , ESR ]

ACSPublic Coverage 138,550 19 [ AGEP , SCHL , MAR , SEX , DIS , ESP , CIT , MIG , MIL , ANC , NATIVITY , DEAR , DEYE , DREM , PINCP , ESR , FER , RAC1P , PUBCOV ]

Published in Transactions on Machine Learning Research (08/2025)

G.3 Computational runtime

All experiments were conducted on an internal cluster equipped with Xeon Gold 6240 CPUs @ 2.60GHz, 192GB RAM, and 240GB local SSD storage. The runtime of our method is influenced by several factors, including dataset size, domain size, and the privacy parameter ϵ. Since our method relies on private marginals released by AIM, its runtime is inherently tied to AIM s computational demands, which are significantly higher than those of the DP baselines.

For clarity, we focus the detailed runtime analysis on the Adult dataset as it is representative of general trends observed across all datasets. On this dataset, AIM runtime increases significantly with ϵ. For linear regression, AIM requires approximately 8 minutes at ϵ = 0.05, scaling up to 18 hours at ϵ = 2.0. In comparison, the DP baseline Ada SSP completes the same experiment in approximately 5 seconds regardless of ϵ. Similar trends are observed for logistic regression, with AIM runtime increasing from 12 minutes to 21 hours as ϵ grows, while the corresponding DP baseline Obj Pert completes these experiments in approximately 2 seconds.

For DP-SGD, the runtime of the hyperparameter search phase varies across datasets, tasks and ϵ between around 1 hour and 17 hours; once the best hyperparameters have been determined, the runtime of DP-SGD, taking the case of linear regression on the Adult dataset as example, varies between 2 minutes for ϵ = 0.05 and 10 minutes for ϵ = 2.0.

G.4 Runtime mitigation strategies

While AIM s runtime is substantial, it is a direct consequence of the complexity involved in accurately computing private marginals for large domains and high utility. A few factors mitigate the runtime of the proposed methods: i) AIM is much faster for low ϵ values; ii) AIM model size can be reduced to 100MB for a significant runtime cut, without significantly sacrificing the accuracy of the method. In the context of this paper, we choose to prioritize accuracy, which is consistent with the motivation of synthetic data, where computation is spent up front to release a data set that can be used downstreams in many ways; iii) other faster query answering mechanisms can be used in place of AIM, such as MST (see 3.5), which runs in approximately 18 minutes for a full experiment across all ϵ values on Adult. Various ways to improve the accuracy vs running-time trade-offs for use in specific settings can be explored, which is beyond the scope of this paper. In terms of experimentation (see Section 4), for computational viability we study the variability of our results across 5 trials.

Published in Transactions on Machine Learning Research (08/2025)

H Additional experiments

DD-SSP can accommodate different DP marginal query releasing methods. For our main results we use AIM (see Appendix A), which uses Private-PGM (Mc Kenna et al., 2019) for the generate step. In this section, we briefly demonstrate the use of the DD-SSP framework with alternative methods. In particular, we consider two alternative settings: i) replacing AIM with MST (Mc Kenna et al., 2021b); ii) replacing the Private-PGM step in AIM with Mixture Inference3, a close approximation of the relaxed projection methods in (Aydore et al., 2021) and (Liu et al., 2021). Results are shown in Figure 5 and Figure 6 respectively.

From the ablation studies in (Mc Kenna et al., 2022), we expect MST to perform well at low ϵ values (high privacy) and AIM to outperform it at higher ϵ values. MST performs a domain compression operation, and we hypothesize this benefits some cases (e.g. low ϵ), but hurts in others. Based on the same ablation studies, we expect Private-PGM to yield lower workload errors with respect to Mixture Inference, resulting in better overall metric scores for the tasks of interest. Consistently with the findings in (Mc Kenna et al., 2022), we find that AIM using Private-PGM has the most reliable performance overall.

ACSincome-LIN

1 ACSincome

1 ACSmobility

ACSemployment

ACSPublic Coverage

Public DD-SSP AIM-Synth DD-SSP (MST) MST-Synth Ada SSP Obj Pert

Figure 5: MST (Mc Kenna et al., 2021b) vs. AIM (Mc Kenna et al., 2022) as a query answering algorithm. Top: linear regression. Bottom: logistic regression. Standard error bars are computed over 5 trials.

3https://github.com/ryan112358/private-pgm/blob/master/src/mbi/mixture_inference.py

Published in Transactions on Machine Learning Research (08/2025)

ACSincome-LIN

1 ACSincome

1 ACSmobility

ACSemployment

ACSPublic Coverage

Public DD-SSP AIM-Synth DD-SSP (Mixt) Mixt-Synth Ada SSP Obj Pert

Figure 6: AIM using Mixture Inference (Aydore et al., 2021; Liu et al., 2021) vs. Private-PGM (Mc Kenna et al., 2019) for model estimation and synthetic data generation. Top: linear regression. Bottom: logistic regression. Standard error bars computer over 5 trials.