# learning_parametric_distributions_from_samples_and_preferences__b8c8b02f.pdf

Learning Parametric Distributions from Samples and Preferences

Marc Jourdan 1 Gizem Yüce 1 Nicolas Flammarion 1

Recent advances in language modeling have underscored the role of preference feedback in enhancing model performance. This paper investigates the conditions under which preference feedback improves parameter estimation in classes of continuous parametric distributions. In our framework, the learner observes pairs of samples from an unknown distribution along with their relative preferences depending on the same unknown parameter. We show that preference-based M-estimators achieve a better asymptotic variance than sample-only M-estimators, further improved by deterministic preferences. Leveraging the hard constraints revealed by deterministic preferences, we propose an estimator achieving an estimation error scaling of O(1/n) a significant improvement over the Θ(1/ n) rate attainable with samples alone. Next, we establish a lower bound that matches this accelerated rate; up to dimension and problem-dependent constants. While the assumptions underpinning our analysis are restrictive, they are satisfied by notable cases such as Gaussian or Laplace distributions for preferences based on the log-probability reward.

1. Introduction Recent progress in language modeling has showcased the effectiveness of preference feedback for fine-tuning (Ziegler et al., 2019; Ouyang et al., 2022; Bai et al., 2022; Touvron et al., 2023; Dubey et al., 2024). Preference data indicating relative quality between outcomes consistently outperforms approaches using positive examples only like supervised fine-tuning (Ivison et al., 2024). This empirical success suggests that preference feedback introduces new, complementary information beyond the observed data. Understanding how and why preferences provide this advantage requires connecting the preference model to the

1School of Computer and Communication Sciences, EPFL, Lausanne, Vaud, Switzerland. Correspondence to: Marc Jourdan <marc.jourdan@epfl.ch>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

data-generating process (Ge et al., 2024).

To understand the role of preference feedback, we focus on a simpler yet illustrative problem: parameter estimation for parametric distributions and preferences. Specifically, the learner observes pairs of samples from an unknown distribution, along with preferences informed by the same parameter. For instance, preferences based on log-probabilities naturally link the preference and probability models, though other formulations are possible (Huang et al., 2024).

For continuous distributions, we uncover a significant statistical learning gap between preference-based and sampleonly estimators. This paper primarily investigates this gap, taking the sample-only maximum likelihood estimator (MLE) optimal among unbiased estimators as a baseline. The well-established theory of M-estimators (Van der Vaart, 2000) suggests that preference-based M-estimators improve asymptotic variance under certain conditions. Yet, this improvement is modest: when samples are of similar quality, preference feedback approaches a fair coin toss, providing minimal additional information. While reducing asymptotic variance is encouraging, it does not fully explain the substantial performance gains observed empirically in large-scale language models.

For deterministic preferences, we prove a more striking result: preference-based estimators achieve a statistically significant acceleration in parameter estimation. Specifically, we show that the estimation error scales as O(1/n) instead of the O(1/ n) rate achieved by sample-only estimators. This acceleration is supported by a matching lower bound, up to dimension and problem-dependent constants.

While this acceleration might sound surprising, the Θ(1/n) rate can already be observed in a special case of sampleonly parameter estimation. For instance, consider estimating the location parameter θ of a uniform distribution on [θ, θ + 1] based solely on samples (Wainwright, 2019). The minimax rate for estimation error is Θ(1/n). The optimal estimator achieving the accelerated rate is the minimum of uniform observations whose density is positive at θ. This improved rate arises from the accumulation of random variables having a positive density at a specific point through a minimum (or maximum) operator, in contrast to the slower aggregation inherent to averaging. Similarly, for deterministic preferences with log-likelihood rewards, we observe

Learning Parametric Distributions from Samples and Preferences

the true ordering between likelihoods. As it enforces hard constraints through a minimum operator on the admissible parameters, our preference-based estimator achieves accelerated convergence.

To illustrate this acceleration, consider the standard normal distribution with preferences based on log-probabilities. Let n N and [n] := {1, , n}. For each i [n], observe samples (Xi, Yi) N(02, I2) along with their log-likelihood deterministic preference Zi := sign((Yi Xi)Si), where Si := (Xi + Yi)/2 is their average. The triplet (Xi, Yi, Zi) imposes a hard constraint based on Si on the location of candidate estimators θ that are consistent with this log-likelihood deterministic preference. Specifically, they satisfy θ Si if Si > 0, and θ Si otherwise. The set of feasible parameters satisfying all constraints is thus maxi [n], Si<0 Si, mini [n], Si>0 Si . Since the density of N(0, 1/2) is positive near zero, the length of this interval decreases as O(1/n) with high probability.

1.1. Contributions For continuous parametric probability distributions, we study the statistical learning gap between preference-based estimators and sample-only estimators.

First, we show that preference-based M-estimators achieve a better asymptotic variance than sample-only M-estimators. The variance is further improved for deterministic preference.

Second, we introduce an estimator satisfying the constraints revealed by the deterministic preferences, and prove an accelerated estimation error rate of O(1/n). This constitutes a significant improvement over the Θ(1/ n) rate achieved by M-estimators.

Third, we provide a lower bound of Ω(1/n), matching our upper bound up to problem-specific constants.

Our results are derived under general assumptions on the distributions and the preferences. While restrictive, they are satisfied by notable cases such as Gaussian or Laplace distributions for preferences based on log-probabilities.

1.2. Related Work Learning parametric distributions. Parametric estimation is a central approach in statistics, reducing inference about a distribution to the estimation of a finite-dimensional parameter (Lehmann & Casella, 2006; Wasserman, 2013). The maximum likelihood estimator (MLE) is the most fundamental method in this setting. Its asymptotic properties are well studied (Cramér, 1946; Ibragimov & Has Minskii, 2013; Van der Vaart, 2000), while non-asymptotic guarantees have been established in Birgé & Massart (1993) and Spokoiny (2012). Lower bounds in parametric estimation rely on techniques such as Le Cam s two-point

method (Le Cam, 1973), Fano s method (Fano, 1952), and Assouad s method (Assouad, 1983), and provide fundamental limits on estimation accuracy (Tsybakov, 2009).

Learning parametric value/preference functions. In the tabular setting, learning from pairwise comparisons aligns with the ranking problem. The performance of MLE under the Bradley-Terry model (Bradley & Terry, 1952) and its extensions has been extensively studied (Hunter, 2004; Negahban et al., 2012; Hajek et al., 2014; Rajkumar & Agarwal, 2014; Shah et al., 2016; Shah & Wainwright, 2018; Mao et al., 2018). The continuous setting, where generalization beyond observed preferences is required, has received less attention, except for linear utility functions (Zhu et al., 2023; Ge et al., 2024; Yao et al., 2025). Beyond analyzing the sample complexity of reward learning with MLE under the Bradley-Terry noise model, Zhu et al. (2023) study the performance of policies trained on the learned reward model. They show that while MLE may fail, a pessimistic variant can yield a policy with improved performance. Relaxing the noise assumption, Ge et al. (2024) show that utility parameters remain unidentifiable without strong modeling assumptions, even with noise-free query responses. However, they demonstrate that, in the active learning setting, utility can still be learned, even in the absence of noise. Their results highlight that the sampling distribution of observations must be aligned with the utility function to achieve improved sample complexity. Yao et al. (2025) leverages sparsity in the preference model and establish sharp estimation rates depending on the sparsity level. Finally, related estimation problems have also been studied in the contexts of dueling bandits and reinforcement learning (Faury et al., 2020; Saha et al., 2023).

Fine-tuning with preference data. Large language models often go through a post-training phase focusing mainly on learning from preference feedback (Lambert, 2024), to improve capabilities such as summarization, instruction following, and reasoning. The standard approach, reinforcement learning from human feedback (RLHF) (Ziegler et al., 2019), trains a reward model to align with human preferences and then optimizes the policy using reinforcement learning, typically with PPO (Schulman et al., 2017). RLHF follows three main steps: supervised fine-tuning, reward model training, and policy optimization. Another line of work has explored alternatives to PPO to simplify training. One such method, direct preference optimization (DPO), reformulates the reward function to learn a policy directly from preference data, avoiding an explicit reward model. Other preference optimization objectives have also been proposed (Meng et al., 2024). Finally, while preference data has traditionally been gathered through human annotators, the learning paradigm has recently expanded to include selfplay where the model critiques its own generations (Dubey

Learning Parametric Distributions from Samples and Preferences

et al., 2024; Huang et al., 2024).

2. Problem Statement Parameter estimation. Let Θ Rk be a set of parameters for a class of continuous probability distributions F over X Rd. Let BΘ := maxθ Θ θ be the bound on Θ for the norm specific to F. Let Sk 1 be the unit sphere for this norm. Let p 2 θ be the distribution of two independent observations of pθ.

Let θ be an unknown parameter to estimate. Our samples are drawn from p 2 θ , i.e., (X, Y ) p 2 θ . We use two archetypal examples satisfying our assumptions. First, the class FN,Σ of multivariate Gaussian distributions with known covariance Σ, where Θ are the natural parameters with norm Σ where x Σ :=

x TΣx. Second, the class FLap,b of Laplace distributions with known scale b, where Θ are the mean parameters with norm | |.

Preference feedback. Let ℓθ : X 2 R be a parametric preference function. Given a parametric reward function rθ, a reward-based preference function is defined as ℓθ(x, y) = rθ(x) rθ(y). As a concrete example for our derivations, we consider preference based on the log-probability reward rθ = log pθ. Given observations (x, y), the true preference z of x over y is governed by sign(ℓθ(x, y)) { 1, 0}. In many settings, however, the observed preference Z can be stochastic due to noise or randomness in human feedback.

Conditioned on (X, Y ) p 2 θ , we denote the p.d.f. of the law of the preference Z by h(ℓθ(X, Y ), ). On X 2 { 1, 0}, the p.d.f. of the law of (X, Y, Z) is denoted as qθ,h(x, y, z) := p 2 θ (x, y)h(ℓθ(x, y), z). Under deterministic feedback, the true preferences are observed:

hdet( , z) := 1 (z = sign( )) . (1)

Under stochastic feedback, noisy preferences z { 1} are observed based on the sigmoid link:

hsto( , z) := σ(z ) with σ(x) := (1 + e x) 1 . (2)

Informative preferences. A natural question is to see when preference Z h(ℓθ (X, Y ), ) helps to estimate θ

compared to using samples (X, Y ) p 2 θ only. Intuitively, given observations with null preference gradient, parameters close to θ could have similar preferences. Therefore, those samples are not sufficient to discriminate between them. For that, let G0(θ ) = {(x, y) X 2 | |ℓθ (x, y)| > 0} (resp. G1(θ ) = {(x, y) G0(θ ) | θ ℓθ (x, y) > 0}) be the set of pairs with non-zero preference (resp. gradient) function. For observations in G0(θ ) , the preference is zero, hence uninformative. For observations in G1(θ ) , the preference is locally independent of the parameter. Therefore,

they do not provide gradient information to distinguish θ

from a neighboring alternative parameter. Only the preferences of samples in G1(θ ) can provide information on θ , hence preference learning is meaningful if these samples are observed, i.e., Pp 2 θ (G1(θ )) > 0 for all θ Θ.

Negative examples. The above condition is restrictive both on ℓθ and pθ, even when considering rθ = log pθ. For example, taking pθ as the uniform distribution over [0, θ], we have Pp 2 θ (G1(θ)) = 0.

2.1. Sample-only MLE In the absence of preference observations, a natural baseline is to estimate θ directly from the observations. Given (Xi, Yi)i [n] p 2n θ , the sample-only (SO) MLE is

bθSO n arg min θ LSO n (θ) with

LSO n (θ) := X

i [n] log p 2 θ (Xi, Yi) . (SO MLE)

Asymptotic normality. Under enough regularity (Van der Vaart, 2000), SO MLE is asymptotically normal, i.e., n(bθSO n θ ) n + N(0k, I(p 2 θ ) 1) ,

where I(pθ) := Epθ[ 2 θ log pθ] is the Fisher information matrix of pθ and denote the convergence in distribution. Let denote the Loewner order on p.s.d. matrices. By the Cramér-Rao bound (Rao, 1992), SO MLE has optimal asymptotic covariance among the class of unbiased sampleonly estimators, i.e., all sample-only unbiased estimator with asymptotic variance V satisfy V I(p 2 θ ) 1.

While asymptotic guarantees provide insight into estimator behavior as n , they do not capture performance in the relevant regime of moderate sample sizes. Modern statistics gives meaningful non-asymptotic concentration results on empirical estimators, e.g., for high-dimensional statistics (Vershynin, 2018; Wainwright, 2019).

Regularity conditions. The asymptotic statistics literature has devised weak regularity conditions under which asymptotic normality holds. Classical conditions assume stronger conditions, e.g., θ 7 log pθ(x) is three times continuously differentiable for every x X and the integral of its third derivative converges uniformly for all θ (Van der Vaart, 2000, Chapter 5.6). Those weak or classical conditions ensure that integrals and derivatives can be exchanged, and Taylor approximations around θ are well controlled. Throughout this paper, we use under enough regularity to refer to these regularity conditions on both pθ and ℓθ.

For preferences based on the reward rθ = log pθ, the regularity of pθ implies the one of the preference ℓθ due to the

Learning Parametric Distributions from Samples and Preferences

properties of the logarithm. Moreover, those regularity conditions are satisfied for numerous well-known distributions such as FN,Σ and FLap,b. When studying deterministic preferences, we introduce general geometric assumptions on pθ and ℓθ. Since these conditions are inherently more restrictive, our goal is not to identify the weakest possible regularity assumptions under which our derivations hold.

3. Preference-based M-estimator In this section, we investigate when preference-based estimators can improve upon sample-only estimators. Given preference-labeled observations {(Xi, Yi, Zi)}i [n], we define the stochastic preferences MLE (SP MLE) as

bθSP n arg min θ LSP n (θ) with

LSP n (θ):=LSO n (θ) X

i [n] log σ(Ziℓθ(Xi, Yi)) . (SP MLE)

This objective extends SO MLE by adding a preferencebased term: a binary classification loss using the logistic function ( log σ(x)). When preferences are stochastic, this estimator corresponds to the MLE under a probabilistic preference model, justifying its name. Under sufficient regularity, M-estimators achieve asymptotic normality, so our goal is to obtain lower asymptotic covariance for SP MLE than for SO MLE. In addition, we want to show that SP MLE reaches a lower asymptotic covariance for deterministic preferences than for stochastic preferences.

3.1. Stochastic Preferences Under stochastic feedback, we are given noisy preference observations (Xi, Yi, Zi)i [n] q n θ ,hsto, where hsto is defined in Equation (2). SP MLE is a specific instance of M-estimator. Under enough regularity (Van der Vaart, 2000, Chapter 5.5), SP MLE is asymptotically normal, i.e., n(bθSP n θ ) n + N(0k, I(qθ ,hsto) 1) ,

where I(qθ,hsto) := Eqθ,hsto[ 2 θ log qθ,hsto] denotes the Fisher information matrix of qθ,hsto. By the Cramér-Rao bound (Rao, 1992), this variance is optimal among unbiased estimators that rely on stochastic preferences. Lemma 3.1 compares its efficiency to the sample-only MLE. Lemma 3.1. Let SP θ := Ep 2 θ [σ(ℓθ)σ( ℓθ) θℓθ θℓT θ].

Then, I(qθ ,hsto) = I(p 2 θ ) + SP θ . The p.s.d. matrix SP θ is definite if Pp 2 θ (| u, θ ℓθ | > 0) > 0 for all u Sk 1.

Lemma 3.1 shows that I(qθ ,hsto) I(p 2 θ ) and exhibits a condition under which bθSP n is asymptotically better than bθSO n , meaning that incorporating preference data improves asymptotic efficiency. The condition in Lemma 3.1 ensures that θ ℓθ spans all directions with some probability, making the preference-based estimator asymptotically superior to the sample-only MLE.

For preferences based on the reward rθ = log pθ, this condition holds for both Laplace and Gaussian distributions: SP θ = 4 b2 SP Lap(0,1) for FLap,b (Appendix G), and SP θ = 2Σ1/2 SP N(0d,Id)Σ1/2 for FN,Σ (Appendix F).

Thus, stochastic preferences can improve parameter estimation compared to sample-only estimators. However, nonasymptotic performance can differ, and in practice, the reduction in asymptotic variance may be small, as we investigate empirically in Section 6. Next, we examine whether M-estimators based on deterministic preferences can further improve upon their stochastic counterparts.

3.2. Deterministic Preferences We now consider the setting where true preferences are observed, meaning that the preference labels Zi are deterministic. We observe (Xi, Yi, Zi)i [n] q [n] θ ,hdet, where hdet is defined in Equation (1). We use the same M-estimator as in the stochastic setting, bθSP n (θ) arg minθ LSP n (θ), but now with deterministic preferences. To distinguish this setting, we introduce the notation SPdet for the preference-based estimator under deterministic feedback.

Consistency of SPdet. Define the population-level objective: M(θ) := Ep 2 θ [log qθ,hsto(X, Y, sign(ℓθ (X, Y )))].

Under enough regularity, bθSPdet n converges to a maximizer of M(θ) (Van der Vaart, 2000, Chapter 5.2). However, unlike in the stochastic setting, θ may not be a maximizer of M since standard regularity conditions on pθ and ℓθ are insufficient. A sufficient condition for consistency is

Ep 2 θ [sign(ℓθ )σ( |ℓθ |) θ ℓθ ] = 0k , (3)

which holds for FN,Σ (Appendix F) and FLap,b (Appendix G) when using reward rθ = log pθ.

Asymptotic variance of SPdet. If Equation (3) holds, then under additional regularity conditions (Van der Vaart, 2000, Chapter 5.3) SPdet is asymptotically normal with covariance V SPdet θ given by the following lemma. Lemma 3.2. Let HSPdet θ := Ep 2 θ uθ 2 θ ℓθ ,

SPdet θ := Ep 2 θ [(2σ(|ℓθ |) 1)σ( |ℓθ |) θ ℓθ θ ℓT θ ]

and RSPdet θ := Ep 2 θ [uθ (Mθ + M T θ )] where uθ :=

sign(ℓθ )σ( |ℓθ |) and Mθ := θ log p 2 θ ( θ ℓθ )T. Then, we have V SPdet θ := V 1 1,θ V2,θ V 1 1,θ where V1,θ = I(qθ ,hsto) HSPdet θ and V2,θ = I(qθ ,hsto) SPdet θ RSPdet θ . If Pp 2 θ (|ℓθ u, θ ℓθ | > 0) > 0 for

all u Sk 1, the p.s.d. matrix SPdet θ is definite.

Equation (3) and V SPdet θ I(qθ ,hsto) 1 depend on the geometry of p 2 θ and ℓθ . We verify that these conditions hold for FN,Σ (Appendix F) and FLap,b (Appendix G) when

Learning Parametric Distributions from Samples and Preferences

using reward rθ = log pθ. For Laplace distribution, we have HSPdet θ = RSPdet θ = 0 and SPdet θ = 4 b2 SPdet Lap(0,1). For

Gaussian distributions, we have HSPdet θ = 0d d, SPdet θ = 2Σ1/2 SPdet N(0d,Id)Σ1/2 and RSPdet θ = 2Σ1/2RSPdet N(0d,Id)Σ1/2

with RSPdet N(0d,Id) 0d d.

Thus, deterministic preferences improve parameter estimation compared to stochastic preferences.

In conclusion, preference-based M-estimators provide asymptotic improvements in estimation efficiency. Next, we explore whether estimators beyond the M-estimation framework can achieve further gains, potentially exceeding the asymptotic normality limitations.

4. Beyond M-estimators While computationally efficient, the SPdet estimator does not fully leverage the constraints imposed by deterministic preferences. Unlike in the stochastic setting, deterministic preferences provide separability: there exist parameters that classify training examples perfectly, including θ itself. A key limitation of SPdet is that, like standard logistic regression, it minimizes a convex surrogate loss (negative log-likelihood). This approach can lead to misclassification of training examples.1 This limitation suggests an opportunity to directly minimize the 0-1 loss2, potentially achieving faster rates of convergence.

0-1 loss minimization. Given (Xi, Yi, Zi)i [n]

q [n] θ ,hdet, we consider the set Cn of parameters that minimize the empirical 0 1 loss, i.e.,

Cn := arg min θ Θ

i [n] 1 (Ziℓθ(Xi, Yi) < 0) (4)

= {θ Θ | i [n], Ziℓθ(Xi, Yi) 0} ,

which is non-empty as θ Cn. Parameters θ Cn perfectly classify all training examples. Any estimator bθAE n Cn is referred to as an arbitrary estimator (AE).

Alternatively, we constrain MLE to this feasible set, defining the deterministic preferences MLE (DP MLE), i.e.,

bθDP n arg min{LSO n (θ) | θ Cn} (DP MLE)

if bθSO n / Cn, and bθDP n := bθSO n otherwise. This estimator minimizes the negative log-likelihood of the samples while ensuring perfect preference classification. For Gaussian with rθ = log pθ, bθDP n estimates θ better than bθSO n for all n, i.e., DP MLE dominates SO MLE statistically.

1For binary classification with separable data, logistic regression converges in direction toward a separating hyperplane. 2For non-separable data, the minimization of the 0-1 classification loss can be NP-hard even for the simple class of linear classifiers, e.g., Feldman et al. (2012).

Lemma 4.1. For all n N and almost surely, we have,

for FN,Σ, bθDP n θ Σ bθSO n θ Σ .

For stochastic preferences, minimizing the 0 1 loss is generally NP-hard, requiring a convex surrogate like the logistic function. However, for deterministic preferences, computing Cn is more tractable. If θ 7 ℓθ is affine, then Cn is a convex polytope, defined by at most n half-space constraints. For Gaussian-based preferences, i.e., rθ = log pθ and FN,Σ, we have Ziℓθ(Xi, Yi) 0 if and only if Zi Xi Yi, θ Σ 1(Xi + Yi)/2 0.

Consistency of 0 1 loss minimization. Define the disagreement probability between θ and θ as m(θ) := Pp 2 θ (D(θ , θ)) where D(θ , θ) := {(x, y) X 2 | ℓθ (x, y)ℓθ(x, y) < 0} is the set of observations where θ and θ assign informative yet opposite preferences.

Under enough regularity (Van der Vaart, 2000, Chapter 5.2), bθAE n and bθDP n converge in C(θ ) := {θ Θ | m(θ) = 0}, which is the non-empty set of minimizers of m(θ) as m(θ) 0 = m(θ ). We note the set C(θ ) contains θ , but possibly others. To ensure consistency (bθAE n , bθDP n θ ), we impose the following identifiability assumption that guarantees C(θ ) = {θ }. Assumption 4.2 (Identifiability). For all θ = θ , m(θ) > 0.

When rθ = log pθ, it holds for both Gaussian (FN,Σ) (Appendix F) and Laplace (FLap,b) (Appendix G) cases.

Fast estimation rate. Once consistency is established, the next goal is to analyze the convergence rate of the estimation errors bθAE n θ and bθDP n θ . Since these are not Mestimators, they are not necessarily limited to the typical parametric rate Ω(1/ n).

Theorem 4.3 states our main result for Laplace and Gaussian distributions when using log-probability rewards, i.e., a highprobability accelerated rate in O(1/n). Theorem 4.3. Let δ (0, 1). For FLap,1 and FN,1, we have, for all n O(log(1/δ)), with probability 1 δ,

bθn Cn, n|bθn θ | = O (log(1/δ)) .

For FN,Σ with d > 1, there exists positive Ad =d + O(

d) such that, for all n e O(log(1/δ)), with probability 1 δ,

bθn Cn, n bθn θ Σ O (Ad log(1/δ) log n) .

Theorem 4.3 is a direct corollary of our main result, showing that maxθ Cn θ θ = O(1/n) (see Theorem 4.8 below). It directly guarantees faster convergence rates for both bθAE n and bθDP n . Theorem 4.8 holds under general geometric conditions on pθ and ℓθ that we introduce with intuitions, while sketching the proof in Section 4.1.

Learning Parametric Distributions from Samples and Preferences

Negative examples. Assumption 4.2 is restrictive both on ℓθ and pθ, even when considering rθ = log pθ. For example, when all the distributions in F agree on their preferences, sign(ℓθ(x, y)) is independent of θ. Therefore, we have m(θ) = 0 for all θ = θ , since ℓθ (x, y)ℓθ(x, y) 0. Such cases include scenarios where pθ(x) is a monotonic function, e.g., the exponential distribution and the Pareto distribution with a known location, as well as the Laplace distribution with a known location. This motivates later assumptions on the directionality of θ ℓθ for observed samples.

Link to iterative human preference alignment. Many human preference alignment methods build on the Bradley Terry model for preference, based on rewards. Direct alignment algorithms use variants of the log-likelihood to define the implicit reward of a policy (Rafailov et al., 2023). Choosing ℓθ(x, y) = log pθ(x) log pθ(y) coincides with the optimal policy for maximum entropy RL (Swamy et al., 2025). When leveraging offline preference data, the assumption (X, Y ) p 2 θ is unrealistic, as ℓθ is collected from a fixed data set of pairs of observations. However, online preference data has become a popular paradigm in the training of recent LLMs. Those iterative alignment procedures rely on the preference data from an earlier model (Dubey et al., 2024). At stage N, the model pθN is trained based on the preference data for generations by the previous model, i.e., (X, Y ) p 2 θN 1. Under the realizability assumption and without mode collapse, this self-refinement paradigm should converge towards the true model pθ . Our setting characterizes the limiting behavior of this iterative process, i.e., preference based on ℓθ for observations from pθ . Nonetheless, we do not claim the direct applicability of DP MLE for realistic LLM training.

4.1. Upper Bound on the Estimation Error We establish a high-probability upper bound on the estimation error maxθ Cn θ θ in the general case. This requires grasping the geometry of Cn relative to θ .

Linearized feasibility set. Since Cn is defined by nonlinear preference constraint, analyzing its geometry is challenging, and we thus consider a linearized approximation of it. We define the linearized constraint set as e Cn := {θ Θ | i [n], (Xi, Yi) / e D(θ , θ)} ,

where e D(θ , θ) := {(x, y) X 2 | ℓθ (x, y)2 + ℓθ (x, y) θ θ , ℓθ (x, y) < 0}. This set replaces ℓθ with its first-order Taylor expansion around θ , neglecting higher-order terms. A key assumption is that the true constraints are at least as strong as the linearized ones. This ensures Cn e Cn, allowing us to control Cn via e Cn. Assumption 4.4 (Linearization validity). For all θ = θ , e D(θ , θ) D(θ , θ).

Directional analysis and informative constraints. To quantify the geometry of e Cn relative to θ , we analyze deviations along directions u Sk 1. Define the set of informative samples along direction u:

G1(θ , u) := {(x, y) | ℓθ (x, y) u, θ ℓθ (x, y) < 0} .

This set contains observations whose preferences give information along the direction u. Assuming that preferences are informative along all directions, we prevent degenerate cases where some directions lack preference information. Assumption 4.5 (Informative Preferences). For all u Sk 1, Pp 2 θ (G1(θ , u)) > 0.

Deviation bound via minimum informative sample. Define Rn,u as the maximal deviation from θ within e Cn along the direction u, i.e.,

Rn,u := max{ε 0 | θ + εu e Cn} .

We define the scaling factor

(x, y) G1(θ , u), Vθ ,u(x, y) := ℓθ (x, y) u, θ ℓθ (x, y) .

The value Vθ ,u(Xi, Yi) quantifies the amount of information in the preference between Xi and Yi to discriminate θ from other parameters on the half-line directed by u. The lower Vθ ,u(Xi, Yi) is, the more discriminative is the preference between Xi and Yi. Since (x, y) G1(θ , u) \ e D(θ , θ + εu) if and only if Vθ ,u(x, y) ε, we obtain

Rn,u min i [n]{Vθ ,u(Xi, Yi) | (Xi, Yi) G1(θ , u)} .

Therefore, the maximal deviation Rn,u is upper bounded by the minimum of positive random variables. It remains to upper bound the resulting value of this minimum with high probability and conclude provided some regularities hold, e.g., positive density at zero. By analyzing the distribution of Vθ ,u, we derive the following probabilistic bound. Lemma 4.6. Suppose Assumption 4.5 hold. For all u Sk 1, with probability 1 δ,

Rn,u F 1 θ ,u(min{1, log(1/δ)/n}) ,

with Fθ ,u(ε) := Pp 2 θ (Vθ ,u (0, ε]) c.d.f. of Vθ ,u.

Since maxθ Cn θ θ maxu Sk 1 Rn,u, Lemma 4.6 shows that the estimation error can be controlled by the behavior of F 1 θ ,u around zero, where F 1 θ ,u(0) = 0.

Regularity assumption. To control F 1 θ ,u near zero, the density F θ ,u should be positive near zero, and we assume control on (F 1 θ ,u) .

Learning Parametric Distributions from Samples and Preferences

Assumption 4.7 (Positive density at zero and regularity of inverse c.d.f.). For all u Sk 1, F θ ,u(0) (0, + ) and there exists (xθ ,u, Mθ ,u) (0, 1) R+ such that supx [0,xθ ,u] |(F 1 θ ,u) (x)| Mθ ,u.

Using this assumption and (F 1 θ ,u) (0) = 1/F θ ,u(0), the first-order Taylor expansion with remainder yields

x [0, xθ ,u], |F 1 θ ,u(x) x/F θ ,u(0)| Mθ ,ux2/2 .

This argument leads to our main theorem, directly for k = 1 and using a covering argument for k > 1. Theorem 4.8. Suppose Assumptions 4.2, 4.4, 4.5 and 4.7 hold. Let δ (0, 1). Let γ > 0 and N(γ) be the γ-covering number of Θ for the norm . Let A 1 θ = minu Sk 1 F θ ,u(0), B 1 θ = minu Sk 1 xθ ,u and Cθ = maxu Sk 1 Mθ ,u/2. When k = 1, for all n Bθ log(2/δ),

max θ Cn θ θ Aθ

n log(2/δ) + Cθ

n2 log(2/δ)2 ,

with probability 1 δ. When k > 1, for all n Bθ log(N(γ)/δ), with probability 1 δ,

max θ Cn θ θ γ+ Aθ

n2 log N(γ)

When k > 1, the choice of the optimal parameter γ depends on the norm . Since Θ is bounded by BΘ, N(γ) is upper bounded by the covering of the ball having diameter BΘ. As an example, let N2(γ) be the ε-covering number of the unit ball in Rk for the Euclidean norm. Then, it is known that log N2(ε) log(ε2k)/ε2 if ε 1/

k and log N2(ε) k log 1 ε2k if ε 1/

k.3 Therefore, optimizing over γ yields an upper bound on maxθ Cn θ θ 2 scaling as e O(Aθ k/n) when n Aθ k3/2, and e O((Aθ /n)1/3) otherwise, where O( ) hides logarithmic terms. For large sample size compared to the dimension, i.e., n Aθ k3/2, we recover a rate of e O(1/n).

In conclusion, we have derived generic assumptions under which the rate of decay of the estimation error of bθAE n and bθDP n is in e O(1/n). This is a significant improvement compared to the asymptotic normality of the SPdet estimator that implies a rate of O(1/ n).

Positive examples. While Assumptions 4.4, 4.5 and 4.7 are restrictive, they hold for FN,Σ (Appendix F) and FLap,b (Appendix G) when using reward rθ = log pθ. This yields Theorem 4.3. We have Aθ = 2b, Bθ = 8 and Cθ = 16b for FLap,b, and Aθ = π(d 1)Γ(d/2)

2Γ((d 1)/2) =+ O(

FN,Σ, hence a rate in O(d3/2/n) when n d2.

3E.g., using Gilbert-Varshamov for the lower bound (Gilbert, 1952) and Maurey s empirical method for the upper bound.

Extended discussions. In Appendix B, we discuss how to verify or weaken our assumptions (Appendix B.1), the sources of misspecification (Appendix B.2) and other reward models than log-likelihood (Appendix B.3).

5. Lower Bound for Deterministic Feedback In this section, we show that the rate O(1/n) is minimax optimal (up to a logarithmic factor) by deriving a matching lower bound. The standard approach to minimax lower bounds in estimation relies on Fano-type inequalities and hypothesis testing reductions. However, due to Assumption 4.2, the Kullback-Leibler divergence and χ2 distance between qθ and qθ are infinite for θ = θ , making these tools ineffective. Instead, we use Assouad s Lemma (Tsybakov, 2009), which provides lower bounds via the total variation distance (defined as TV(P, Q) := P Q 1 for distributions P and Q). Since TV is not well-behaved for product distributions, we use the squared Hellinger distance, defined as H2(P, Q) := 1

P Q 2 2, which satisfies

TV(P n, Q n) p

2H2(P n, Q n) p

2n H2(P, Q).

For further analytical convenience, we also employ the Bhattacharyya coefficient, BC(P, Q) := PQ 1, which is related to the Hellinger distance by H2(P, Q) = 1 BC(P, Q). Since qθ qθ is zero for disagreeing preferences, we define the restricted BC as

f BC( θ, θ) := Ep 2 θ

h 1 D( θ, θ) D0( θ, θ) q

p 2 θ /p 2 θ

where D0( θ, θ) := G0( θ) G0(θ) is the set where the preferences are zero for exactly one parameter.

Lemma 5.1 decomposes the Hellinger distance between two distributions over the preference triplets into the Hellinger distance between sample-only distributions and the disagreement restricted Bhattacharyya coefficient. Lemma 5.1. H2(q θ, qθ) = f BC( θ, θ) + H2(p 2 θ , p 2 θ ) for

all θ, θ Θ.

As H2(p 2 θ , p 2 θ ) 2H2(p θ, pθ), deriving a lower bound

requires controlling H2(p θ, pθ) and f BC( θ, θ), hence, we impose the following assumption. Assumption 5.2. There exists positive constants c1, c2 independent of k and a dimension and problem-dependent scaling function αF(k) such that for all θ, θ Θ,

f BC( θ, θ)+H2(p 2 θ , p 2 θ ) c1 αF(k) θ θ +c2 θ θ 2 .

Theorem 5.3 bounds the minimax estimation error.

Learning Parametric Distributions from Samples and Preferences

Figure 1. Estimation errors for N(θ , Id) where θ U([1, 2]d) for (a) d = 1 with Nruns = 103, and (b) d = 20 with Nruns = 102.

Theorem 5.3. Let Rmax := inf bθ supθ Θ Eqθ [ bθ θ ]. Suppose Assumption 5.2 holds. Then,

This result confirms that the O(1/n) rate is minimax optimal up to logarithmic factors. The scaling αF(k) comes from f BC(θ , θ) (Assumption 5.2), yet it is challenging to link αF(k) with Aθ without further assumptions.

Positive examples. While Assumption 5.2 is restrictive, even when using rewards rθ = log pθ, it holds for FLap,b (Appendix G), i.e.,

f BC(θ , θ) = |θ θ|/(2b) + O(|θ θ|2) ,

as well as for FN,Σ (Appendix F), i.e.,

f BC(θ , θ) = 2e θ θ 2 Σ/4Fθ ,u( θ θ Σ) ,

with Fθ ,u(ε) ε/Aθ and αF(d) = Aθ .

Dimensionality gap. While the lower bound in Theorem 5.3 scales as Ω(αF(k)

k/n) for n αF(k)2, the upper bound in Theorem 4.8 scales as O(Aθ k/n) for n Aθ k3/2. Even for the simple case of Gaussian distributions where Aθ = αF(d), there is a dimensionality gap. Closing this gap is an important direction for future work. Improvements might come from a tighter analysis, e.g., both for the upper and lower bounds, or the derivation of better estimators based on deterministic preferences.

6. Experiments In this section, we compare the empirical performance of the different estimators introduced in this paper. For preferences based on rθ = log pθ, we conduct a set of experiments

for Gaussian distributions, and defer to Appendix H.1 for experiments on Laplace and Rayleigh distributions. In particular, we consider a uniformly drawn mean parameter θ U([1, 2]d) and the isotropic covariance Σ = Id. For sample size n [Nmax] with Nmax = 104, we compute the estimation errors bθn θ 2. We repeat this process for Nruns different instances and for various choices of d.

For FN,Id (Appendix F), the M-estimators can be implemented as bθSO n = 1 2n P

i [n](Xi + Yi),

bθSP n = arg min θ θ bθSO n 2 2 1

i [n] log σ(Ziℓθ(Xi, Yi)) ,

where ℓθ(Xi, Yi) = Xi Yi, θ (Xi + Yi)/2 . Then, the estimators based on Cn = {θ | i [n], Ziℓθ(Xi, Yi) 0} are bθDP n = arg minθ Cn θ bθSO n 2 2 and an arbitrary estimator bθAE n Cn. As Cn is an interval for d = 1, we use the randomized uniform (RU) estimator, i.e., bθRU n U (Cn). We also consider the worst-case estimator (WE), defined as bθWE n := arg maxθ Cn θ θ 1. While it is not a valid estimator due to its θ dependency, it serves as a proxy for the worst estimation error in Cn.

Dependency on sample size. Figure 1(a) confirms empirically the difference in estimation rate between the Mestimators (SO MLE and SP MLE) obtaining O(1/ n) and our estimators based on Cn achieving O(1/n).

However, Figure 1(b) also reveals that the performance of AE and WE deteriorates quickly at small sample sizes when the dimension increases. In contrast, DP MLE consistently outperforms all the other estimators, including SO MLE as theoretically shown in Lemma 4.1.

While SPdet outperforms SO MLE, Figure 1 also reveals that SP performs worse than SO MLE for finite sample size. Therefore, only an M-estimator based on deterministic preferences improves on sample-only M-estimators empirically.

Learning Parametric Distributions from Samples and Preferences

Figure 2. Estimation errors as a function of d with N(θ , Id) where θ U([1, 2]d), for n = 104 and Nruns = 103.

This further highlights the weakness of asymptotic results compared to non-asymptotic guarantees.

While Figure 1(a) suggests that RU and WE perform on par with DP MLE, Figures 1(b) and 2 highlight that DP MLE outperforms WE and AE for larger dimensions, where the gap increases when d is nonnegligible compared to n. We conjecture that RU suffers from the same limitation as AE for larger d. As empirical evidence, we study other estimators that disentangle the effect of RU s randomness versus its mean behavior, see Appendix H.2.

Dependency on dimension. Figure 2 strengthens the aforementioned empirical observations. For fixed sample size and increasing dimension, DP MLE is the only estimator obtaining the best-of-both world estimation error rate, i.e., O(min{d3/2/n, p

Covariance gap. We show that the covariance gap between SP MLE and SO MLE is relative mild: ( SP Lap(0,1), SPdet Lap(0,1)) (0.16, 0.08) and

( SP N(0,1), SPdet N(0,1), RSPdet N(0,1)) (0.17, 0.08, 0.10). More-

over, SP N(0d,Id), SPdet N(0d,Id) and RSPdet N(0d,Id) are close to αd Id where αd > 0 is decreasing in d (see Figure 3 in Appendix F). In addition to having a small empirical gap for a moderate value of n, the asymptotic gaps between SO MLE and SP MLE are mild.

Supplementary experiments. Following the approach of Tang et al. (2024a), we compare estimators using other convex surrogates of the 0-1 loss (Appendix H.3): they all perform similarly. For the logistic loss, we showcase the mild impact of normalization and regularization (Appendix H.4).

7. Perspectives This work investigates the role of preference feedback in parameter estimation for continuous parametric distributions. We establish conditions under which preference-based estimators outperform sample-only methods. For stochastic preferences, the preference-based MLE achieves a lower asymptotic variance than its sample-only counterpart. For deterministic preferences, we demonstrate that preferencebased estimators can significantly accelerate parameter estimation, achieving an improved O(1/n) convergence rate compared to the O(1/ n) rate of M-estimators. Our lower bound analysis further confirms that this acceleration is minimax optimal up to dimension-dependent constants.

While our results provide a solid theoretical foundation, several open questions remain. A finer analysis of beyond M-estimators and their constraint set geometry would allow to better quantify the properties of DP MLE, and provide insights for designing improved estimators that better leverage deterministic preferences. Additionally, exploring alternative preference functions beyond the log-probability gap could extend the applicability of our results.

Finally, a key challenge for future work is to quantify the benefits of preference-based estimation for discrete distributions. For distributions with small support, preference feedback may only localize the unknown parameter within a subset of the simplex, leading to diminishing information gains as the sample size increases. However, understanding how preference-based estimators perform in finite-sample settings, particularly in high-dimensional problems, remains an interesting open problem. Addressing these questions could provide further insights into the role of preferences in machine learning and statistical estimation.

Acknowledgements We thank Jaouad Mourtada for insightful early discussions on the univariate Gaussian case. This work was supported by the Swiss National Science Foundation (grant number 212111) and by an unrestricted gift from Google.

Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Assouad, P. Deux remarques sur l estimation. C. R. Acad. Sci. Paris Sér. I Math., 296(23):1021 1024, 1983. ISSN 0249-6291.

Azar, M. G., Guo, Z. D., Piot, B., Munos, R., Rowland, M.,

Learning Parametric Distributions from Samples and Preferences

Valko, M., and Calandriello, D. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp. 4447 4455. PMLR, 2024.

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., Mc Kinnon, C., et al. Constitutional ai: Harmlessness from ai feedback. ar Xiv preprint ar Xiv:2212.08073, 2022.

Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. Julia: A fresh approach to numerical computing. SIAM Review, 59(1):65 98, 2017. doi: 10.1137/ 141000671. URL https://epubs.siam.org/ doi/10.1137/141000671.

Birgé, L. and Massart, P. Rates of convergence for minimum contrast estimators. Probability Theory and Related Fields, 97:113 150, 1993.

Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324 345, 1952.

Cramér, H. Mathematical Methods of Statistics, volume vol. 9 of Princeton Mathematical Series. Princeton University Press, Princeton, NJ, 1946.

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. ar Xiv preprint ar Xiv:2407.21783, 2024.

Dvoretzky, A., Kiefer, J., and Wolfowitz, J. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, 27(3):642 669, 1956.

Fano, R. Class notes for transmission of information. In Course 6.574. MIT Cambridge, MA, 1952.

Faury, L., Abeille, M., Calauzènes, C., and Fercoq, O. Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning, pp. 3052 3060. PMLR, 2020.

Feldman, V., Guruswami, V., Raghavendra, P., and Wu, Y. Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing, 41(6):1558 1590, 2012. doi: 10.1137/120865094.

Ge, L., Juba, B., and Vorobeychik, Y. Learning linear utility functions from pairwise comparison queries. ar Xiv preprint ar Xiv:2405.02612, 2024.

Gilbert, E. N. A comparison of signalling alphabets. The Bell System Technical Journal, 31(3):504 522, 1952. doi: 10.1002/j.1538-7305.1952.tb01393.x.

Gorbatovski, A., Shaposhnikov, B., Sinii, V., Malakhov, A., and Gavrilov, D. The differences between direct alignment algorithms are a blur. ar Xiv preprint ar Xiv:2502.01237, 2025.

Hajek, B., Oh, S., and Xu, J. Minimax-optimal inference from partial rankings. Advances in Neural Information Processing Systems, 27, 2014.

Hong, J., Lee, N., and Thorne, J. Orpo: Monolithic preference optimization without reference model. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 11170 11189, 2024.

Huang, A., Block, A., Foster, D. J., Rohatgi, D., Zhang, C., Simchowitz, M., Ash, J. T., and Krishnamurthy, A. Self-improvement in language models: The sharpening mechanism. In Neur IPS 2024 Workshop on Mathematics of Modern Machine Learning, 2024.

Huangfu, Q. and Hall, J. J. Parallelizing the dual revised simplex method. Mathematical Programming Computation, 10(1):119 142, 2018.

Hunter, D. R. Mm algorithms for generalized bradley-terry models. The annals of statistics, 32(1):384 406, 2004.

Ibragimov, I. A. and Has Minskii, R. Z. Statistical estimation: asymptotic theory, volume 16. Springer Science & Business Media, 2013.

Ivison, H., Wang, Y., Liu, J., Wu, Z., Pyatkin, V., Lambert, N., Smith, N. A., Choi, Y., and Hajishirzi, H. Unpacking dpo and ppo: Disentangling best practices for learning from preference feedback. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.

Lambert, N. Reinforcement Learning from Human Feedback. Online, 2024. URL https://rlhfbook.com.

Le Cam, L. Convergence of estimates under dimensionality restrictions. The Annals of Statistics, pp. 38 53, 1973.

Lehmann, E. L. and Casella, G. Theory of point estimation. Springer Science & Business Media, 2006.

Lubin, M., Dowson, O., Garcia, J. D., Huchette, J., Legat, B., and Vielma, J. P. Jump 1.0: Recent improvements to a modeling language for mathematical optimization. Mathematical Programming Computation, 15(3):581 589, 2023.

Mao, C., Weed, J., and Rigollet, P. Minimax rates and efficient algorithms for noisy sorting. In Algorithmic Learning Theory, pp. 821 847. PMLR, 2018.

Learning Parametric Distributions from Samples and Preferences

mathlib Community, T. The lean mathematical library. In Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, POPL 20. ACM, January 2020. doi: 10.1145/3372885.3373824. URL http://dx.doi.org/10.1145/3372885. 3373824.

Meng, Y., Xia, M., and Chen, D. Simpo: Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems, 37:124198 124235, 2024.

Munos, R., Valko, M., Calandriello, D., Azar, M. G., Rowland, M., Guo, Z. D., Tang, Y., Geist, M., Mesnard, T., Fiegel, C., et al. Nash learning from human feedback. In International Conference on Machine Learning, pp. 36743 36768. PMLR, 2024.

Negahban, S., Oh, S., and Shah, D. Iterative ranking from pair-wise comparisons. Advances in neural information processing systems, 25, 2012.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730 27744, 2022.

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36: 53728 53741, 2023.

Rajkumar, A. and Agarwal, S. A statistical convergence perspective of algorithms for rank aggregation from pairwise data. In International conference on machine learning, pp. 118 126. PMLR, 2014.

Rao, C. R. Information and the accuracy attainable in the estimation of statistical parameters. In Breakthroughs in Statistics: Foundations and basic theory, pp. 235 247. Springer, 1992.

Saha, A., Pacchiano, A., and Lee, J. Dueling rl: Reinforcement learning with trajectory preferences. In International Conference on Artificial Intelligence and Statistics, pp. 6263 6289. PMLR, 2023.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Shah, N. B. and Wainwright, M. J. Simple, robust and optimal ranking from pairwise comparisons. Journal of machine learning research, 18(199):1 38, 2018.

Shah, N. B., Balakrishnan, S., Bradley, J., Parekh, A., Ramch, K., Wainwright, M. J., et al. Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence. Journal of Machine Learning Research, 17(58):1 47, 2016.

Spokoiny, V. Parametric estimation. Finite sample theory. The Annals of Statistics, 40(6):2877 2909, 2012.

Swamy, G., Dann, C., Kidambi, R., Wu, S., and Agarwal, A. A minimaximalist approach to reinforcement learning from human feedback. In Forty-first International Conference on Machine Learning, 2024.

Swamy, G., Choudhury, S., Sun, W., Wu, Z. S., and Bagnell, J. A. All roads lead to likelihood: The value of reinforcement learning in fine-tuning. ar Xiv preprint ar Xiv:2503.01067, 2025.

Tang, Y., Guo, Z. D., Zheng, Z., Calandriello, D., Munos, R., Rowland, M., Richemond, P. H., Valko, M., Avila Pires, B., and Piot, B. Generalized preference optimization: A unified approach to offline alignment. In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp. 47725 47742. PMLR, 21 27 Jul 2024a.

Tang, Y., Guo, Z. D., Zheng, Z., Calandriello, D., Munos, R., Rowland, M., Richemond, P. H., Valko, M., Pires, B. Á., and Piot, B. Generalized preference optimization: A unified approach to offline alignment. ar Xiv preprint ar Xiv:2402.05749, 2024b.

The Sage Developers. Sage Math, the Sage Mathematics Software System (Version 9.7), 2022. https://www.sagemath.org.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and finetuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023.

Tsybakov, A. B. Nonparametric estimators. Introduction to Nonparametric Estimation, pp. 1 76, 2009.

Van der Vaart, A. W. Asymptotic statistics, volume 3. Cambridge university press, 2000.

Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Wächter, A. and Biegler, L. T. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical programming, 106:25 57, 2006.

Learning Parametric Distributions from Samples and Preferences

Wainwright, M. J. High-Dimensional Statistics: A Non Asymptotic Viewpoint, volume 48. Cambridge University Press, 2019.

Wang, R., Sun, J., Hua, S., and Fang, Q. Asft: Aligned supervised fine-tuning through absolute likelihood. ar Xiv preprint ar Xiv:2409.10571, 2024.

Wasserman, L. All of statistics: a concise course in statistical inference. Springer Science & Business Media, 2013.

Yao, Y., He, L., and Gastpar, M. Leveraging sparsity for sample-efficient preference learning: A theoretical perspective. ar Xiv preprint ar Xiv:2501.18282, 2025.

Zhu, B., Jordan, M., and Jiao, J. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. In International Conference on Machine Learning, pp. 43037 43067. PMLR, 2023.

Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning language models from human preferences. ar Xiv preprint ar Xiv:1909.08593, 2019.

Learning Parametric Distributions from Samples and Preferences

A. Outline The appendices are organized as follows:

In Appendix B, we provide detailed discussions on our assumptions, the sources of misspecification and other reward models.

In Appendix C, we prove the general results presented in Section 3 such as Lemma 3.1.

In Appendix D, we focus on Section 4 and detail the proofs of Lemma 4.6 and Theorem 4.8

In Appendix E, we prove the results presented in Section 5.

In Appendix F, for FN,Σ and preferences based on rθ = log pθ, we prove all the assumptions introduced in this paper.

In Appendix G, for FLap,b and preferences based on rθ = log pθ, we prove all the assumptions introduced in this paper.

In Appendix H, we provide supplementary experiments to support our theoretical findings.

B. Extended Discussions We provide detailed discussions on how to verify or weaken our assumptions (Appendix B.1), the sources of misspecification (Appendix B.2) and other reward models than log-likelihood (Appendix B.3).

B.1. Verifying or Weakening our Assumptions Since our assumptions are restrictive, it is natural to wonder how they can be verified or weakened.

Verifying our assumptions. Even a closed-form definition of pθ and ℓθ is given, our assumptions are challenging to verify, hence we suggest using a formal verifier (e.g., Lean (mathlib Community, 2020)) or software (e.g., Sage Math (The Sage Developers, 2022)). Empirically, they can be confirmed or rejected by sampling from p 2 θ . Assumption 4.4 is rejected by exhibiting (Xi, Yi) e D(θ , θ) \ D(θ , θ). Assumptions 4.2 and 4.5 are confirmed by finding (Xi, Yi) D(θ , θ) and (Xi, Yi) G1(θ , u). Those tests sampling complexity scales as the inverse event s probability. Using Dvoretzky Kiefer Wolfowitz inequality (Dvoretzky et al., 1956), Fθ ,u can be estimated to verify that Assumption 4.7 holds.

Restrictive assumptions. When studying DP MLE only, we conjecture that the global Assumptions 4.2 and 4.4 can be weakened to local versions. Using time-uniform concentration results, we can build a sequence of shrinking confidence regions (Rn)n around SO MLE that contains θ for all time n with high probability. Then, we modify DP MLE to be constrained on Rn Cn. For n large enough and with high probability, Rn Cn will be included in a local neighborhood of θ under which the local Assumptions 4.2 and 4.4 are satisfied. Given that Assumption 4.4 is based on ignoring the reminder term in a first-order Taylor expansion, assuming a local version is a significantly weaker requirement.

B.2. Sources of Misspecification There are several possible sources of misspecification not taken into account by our current analysis.

Preference model. The Bradley-Terry model that uses reward-based preferences has limited expressivity as it doesn t allow for intransitive preferences. Even when individuals exhibit transitive preferences, their averaged preferences can be intransitive due to disagreements, see Munos et al. (2024) or Swamy et al. (2024).

Parameter space. When θ / Θ, the deterministic preferences might not provide separability within Θ. The definition of DP MLE should be modified to combine the cross-entropy loss and the classification 0-1 loss, i.e.,

bθDP n arg min θ Θ

LSO n (θ) + λ X

i [n] 1 (Ziℓθ(Xi, Yi) < 0)

where λ > 0 is a regularization between those two losses. Equation (5) is reminiscent of single-stage alignment procedures such as ORPO (Hong et al., 2024) and ASFT (Wang et al., 2024), see, e.g., Gorbatovski et al. (2025). Without separability, solving Eq. (5) can be NP-hard. Under sufficient regularity, bθDP n converges to θ0 arg minθ Θ{KL(θ , θ)+λm(θ)} where

Learning Parametric Distributions from Samples and Preferences

m(θ) = Pp 2 θ (D(θ , θ)) and θ0 = θ . As θ 7 m(θ) can be non-convex, computing θ0 might be challenging. Deriving a tractable ELBO method for this optimization is an interesting direction to obtain tractable and robust estimators. As θ0 lies in the boundary of Θ, we should control the maximal deviation with respect to θ0 for directions that point towards the interior of Θ to prove an accelerated rate. While parts of our analysis could be used, we believe that finer technical arguments are required.

Parametric model. The true distribution p of the observations might not even be a member of our class of distributions F, i.e., p / F. These situations occur when F doesn t contain the true structure, e.g., other parametric or non-parametric class of distributions. Then, LSO n (θ) can be interpreted as a quasi-log-likelihood term. Let us denote by SO quasi-MLE the estimator based on SO MLE for this quasi-log-likelihood. Under sufficient regularity, SO quasi-MLE converges towards θ0 arg minθ Θ KL(p , pθ) where p = pθ0 F. Without the separability from well-specified deterministic preference, we define DP quasi-MLE as in Eq. (5). Under sufficient regularity, DP quasi-MLE converges towards the minimizer of a similar optimization problem combining the KL term and a misspecified equivalent of m(θ).

B.3. Reward models Except for Theorem 4.3, all the derivations in Section 4 hold for general (hence reward-based) preference models provided Assumptions 4.2, 4.4, 4.5 and 4.7 hold. Characterizing the expressivity of parametric rewards satisfying those assumptions is interesting, yet challenging. We provide two positive and one negative examples.

Positive: monotonic reward. Suppose that ℓθ(x, y) = f(pθ(x)) f(pθ(x)) where f is increasing on [0, 1]. Since sign( ℓθ) = sign(ℓθ), hence the parameters with zero classification loss and our estimators are the same. Therefore, our results hold for this class of rewards when our assumptions hold for the log-likelihood reward. When f is decreasing, the preferences are reversed , and similar arguments can be made. This example includes (1) normalization by a multiplicative constant (e.g., temperature β) and (2) the odds-ratio reward-based preference based on f(x) = log(x/(1 x)) used by ORPO in Hong et al. (2024).

Positive: margin with Gaussian. Suppose that ℓθ = ℓθ + c where c is a constant and ℓθ is the Gaussian log-likelihood preference. By extending our computations from Appendix F, Assumptions 4.2, 4.4, 4.5 and 4.7 hold with c-dependent positive constants. Margins are used by Sim PO from Meng et al. (2024) and IPO from Azar et al. (2024).

Negative: reference model with Gaussian. Suppose that ℓθ = ℓθ ℓθ0 where θ0 is known and ℓθ is the Gaussian log-likelihood preference. Since ℓθ(x, y) = x y, θ θ0 and θ ℓθ(x, y) = x y, Assumption 4.5 is violated for u = θ θ0, i.e., Pp 2 θ (G1(θ , θ θ0)) = 0. Not all direct alignment algorithms rely on a reference model, see, e.g., Sim PO and ORPO.

C. Proofs of Section 3 C.1. Stochastic Preferences Under enough regularity, by swapping the integration and the differentiation operators, we can show that

Eqθ,hsto[ θ log qθ,hsto] = θ1 = 0k and Eqθ,hsto[ 2 θ log qθ,hsto] = Eqθ,hsto[ θ log qθ,hsto θ log q T θ,hsto] .

Below, we detail the proof of Lemma 3.1.

Proof. Direct computation yields that

θ log qθ (x, y, z) = θ log p 2 θ (x, y) + zσ( zℓθ (x, y)) θ ℓθ (x, y) ,

2 θ log qθ (x, y, z) = 2 θ log p 2 θ (x, y) σ(ℓθ (x, y))σ( ℓθ (x, y)) θ ℓθ (x, y) θ ℓθ (x, y) T

+ zσ( zℓθ (x, y)) 2 θ ℓθ (x, y) .

where we used that g (x) = σ( x) and g (x) = σ ( x) = σ(x)σ( x) with g(x) = log σ(x). By definition of hsto,

EZ|(X,Y ) Zσ( Zℓθ (X, Y )) 2 θ ℓθ (X, Y )

= (σ( ℓθ (X, Y ))σ(ℓθ (X, Y )) σ(ℓθ (X, Y ))σ( ℓθ (X, Y ))) 2 θ ℓθ (X, Y ) = 0d d .

Learning Parametric Distributions from Samples and Preferences

Therefore, we have I(qθ ,hsto) = I(p 2 θ ) + SP θ with SP θ = Ep 2 θ [σ(ℓθ)σ( ℓθ) θℓθ( θℓθ)T]. For all x Rk, we have

x T SP θ x = x 2Ep 2 θ [σ(ℓθ)σ( ℓθ) x/ x , θℓθ 2] 0 .

It is direct to see that this inequality is strict except if Pp 2 θ ( x/ x , θℓθ 2 = 0) = 1. Therefore, SP θ is a positive definite matrix if Pp 2 θ (| u, θ ℓθ | > 0) > 0 for all u Sk 1. Note that this condition is implied by Assumption 4.5.

C.2. Deterministic Preferences Consistency of SP. Let M(θ) := Ep 2 θ [log qθ,hsto(X, Y, sign(ℓθ (X, Y )))]. Under enough regularity, we obtain

Ep 2 θ [ θ log p 2 θ ] = 0k and θ M(θ ) = Ep 2 θ [sign(ℓθ (X, Y ))σ( |ℓθ (X, Y )|) θ ℓθ (X, Y )] .

Therefore, θ is the unique maximizer of M(θ) if θ M(θ ) = 0k, i.e., if (3) holds true.

Asymptotic normality of SP. Provided (3), under enough regularity, the theory of M-estimator yields that

n(bθSPdet n θ ) n + N(0k, V 1 1,θ V2,θ V 1 1,θ ) ,

V1,θ = E(X,Y ) p 2 θ 2 θ log qθ ,hsto(X, Y, sign(ℓθ (X, Y ))) ,

V2,θ = E(X,Y ) p 2 θ [ θ log qθ ,hsto(X, Y, sign(ℓθ (X, Y ))) θ log qθ ,hsto(X, Y, sign(ℓθ (X, Y ))) T] .

Below we detail the proof of Lemma 3.2.

Proof. Combining z = sign(ℓθ (x, y)) with the same manipulation as above yields

θ log qθ ,hsto(x, y, z) = θ log p 2 θ (x, y) + sign(ℓθ (x, y))σ( |ℓθ (x, y)|) θ ℓθ (x, y) ,

θ log qθ ,hsto(x, y, z) θ log qθ ,hsto(x, y, z) T = θ log p 2 θ (x, y) θ log p 2 θ (x, y) T

+ σ( |ℓθ (x, y)|)2 θ ℓθ (x, y) θ ℓθ (x, y) T

+ sign(ℓθ (x, y))σ( |ℓθ (x, y)|) θ log p 2 θ (x, y) θ ℓθ (x, y) T + θ ℓθ (x, y) θ log p 2 θ (x, y) T

2 θ log qθ ,hsto(x, y, z) = 2 θ log p 2 θ (x, y) σ(ℓθ (x, y))σ( ℓθ (x, y)) θ ℓθ (x, y) θ ℓθ (x, y) T

+ sign(ℓθ (x, y))σ( |ℓθ (x, y)|) 2 θ ℓθ (x, y) .

where we used that g (x) = σ( x) and g (x) = σ ( x) = σ(x)σ( x) with g(x) = log σ(x). Using that σ( |ℓθ (x, y)|)2 = σ( |ℓθ (x, y)|) σ( ℓθ (x, y))σ(ℓθ (x, y)), we have

V1,θ = I(p 2 θ ) + SP θ HSPdet θ and V2,θ = I(p 2 θ ) + M2,θ SP θ RSPdet θ

where SP θ = Ep 2 θ [σ(ℓθ)σ( ℓθ) θℓθ( θℓθ)T] as in Lemma 3.1, and we define

HSPdet θ = Ep 2 θ sign(ℓθ )σ( |ℓθ |) 2 θ ℓθ , M2,θ = Ep 2 θ [σ( |ℓθ |) θ ℓθ ( θ ℓθ ) T] and

RSPdet θ = Ep 2 θ sign(ℓθ )σ( |ℓθ |) θ log p 2 θ ( θ ℓθ ) T + θ ℓθ ( θ log p 2 θ ) T .

Using that I(qθ ,hsto) = I(p 2 θ )+ SP θ (Lemma 3.1), SPdet is asymptotically better than SP if and only if V 1 1,θ V2,θ V 1 1,θ I(qθ ,hsto) 1. This condition can be rewritten as

I(p 2 θ ) + M2,θ SP θ RSPdet θ I(p 2 θ ) + SP θ HSPdet θ I(qθ ,hsto) HSPdet θ I(qθ ,hsto) 1 . (6)

The condition (6) heavily depends on the geometry of p 2 θ and ℓθ , hence it is unreasonable to assume in all generality.

Learning Parametric Distributions from Samples and Preferences

In the following, we consider the special case where HSPdet θ = 0d d. This occurs when θ ℓθ is linear, e.g., for FN,Σ and FLap,b and preferences based on rθ = log pθ. Then, the condition (6) rewrites as

RSPdet θ + SPdet θ 0d d with SPdet θ := 2 SP θ M2,θ .

Using that minx R σ(|x|) = 1/2 achieved only at x = 0, we have directly that, for all x Rk,

x T SPdet θ x = x 2Ep 2 θ [(2σ(|ℓθ |) 1)σ( |ℓθ |) x/ x , θ ℓθ 2] 0 ,

It is direct to see that this inequality is strict except if Pp 2 θ (ℓθ x/ x , θ ℓθ = 0) = 1. Therefore, SPdet θ is a positive definite matrix if Pp 2 θ (|ℓθ u, θ ℓθ | > 0) > 0 for all u Sk 1. Then, a sufficient condition for the condition (6) to

hold is that RSPdet θ is a p.s.d. matrix, i.e., RSPdet θ 0d d.

In summary, we have derived sufficient conditions for SPdet to be asymptotically better than SP, namely HSPdet θ = 0d d, RSPdet θ 0d d and Pp 2 θ (|ℓθ u, θ ℓθ | > 0) > 0 for all u Sk 1. Note that this last condition is implied by Assumption 4.5.

D. Proofs of Section 4 D.1. Proof of Lemma 4.1

Proof. For Gaussian distributions, this is a direct consequence of the following facts: θ Cn, bθDP n arg minθ Cn θ bθSO n 2 Σ and Cn is convex.

D.2. Proof of Lemma 4.6

Proof. Let u Sk 1. Let e Fθ ,u be the c.d.f. of Vθ ,u(X, Y ) when (X, Y ) (p 2 θ )|G1(θ ,u), i.e., p 2 θ truncated to G1(θ , u). Then, e Fθ ,u(ε) = Fθ ,u(ε)/αθ ,u. Let αθ ,u = Pp 2 θ (G1(θ , u)) and Nθ ,u = P i [n] 1 ((Xi, Yi) G1(θ , u))

Bin(n, αθ ,u). Let e Rn,u = mini [n],(Xi,Yi) G1(θ ,u) Vθ ,u(Xi, Yi). Using the derivation in Section 4.1, we have that Rn,u e Rn,u. Let ε > 0. Conditioned on Nθ ,u, it is direct to see that

P( e Rn,u > ε | Nθ ,u) = 1 (1 (1 e Fθ ,u(ε))Nθ ,u) = 1 e Fθ ,u(ε) Nθ ,u .

Using that Nθ ,u Bin(n, αθ ,u), EX Bin(n,p)[s X] = (1 p + ps)n and 1 x exp( x), we obtain that

P(Rn,u > ε) P( e Rn,u > ε) 1 αθ ,u e Fθ ,u(ε) n exp ( n Fθ ,u(ε)) .

Taking ε = F 1 θ ,u (min {1, log(1/δ)/n}) concludes the proof.

D.3. Proof of Theorem 4.8

Proof. For all u Sk 1, let (Aθ , Bθ , Cθ ) defined as in Theorem 4.8. Since Cn e Cn under Assumption 4.4, we obtain that maxθ Cn θ θ maxθ e Cn θ θ .

Case k = 1. Since |S0| = 2, using Lemma 4.6 with a union bound yield that, with probability at least 1 δ,

max θ e Cn θ θ max u S0 Rn,u max u S0 F 1 θ ,u (min{1, log(2/δ)/n}) .

Under Assumption 4.7, for n Bθ log(2/δ), we can conclude the proof since

n max θ Cn θ θ n max θ e Cn θ θ Aθ log(2/δ) + Cθ log(2/δ)2/n .

Case k > 1. Let N(γ) be the γ-covering number of Θ for the norm . Let {θj}j [N(γ)] be such a γ-covering. For all j [N(γ)], let εj = θj θ and uj = (θj θ )/εj. Using triangular inequality, we obtain

max θ e Cn θ θ γ + max j [N(γ)], θj e Cn θj θ γ + max j [N(γ)] 1 θ + εjuj e Cn εj γ + max j [N(γ)] Rn,uj .

Learning Parametric Distributions from Samples and Preferences

Using Lemma 4.6 with a union bound yield that, with probability at least 1 δ,

n max θ Cn θ θ nγ + max j [N(γ)] n F 1 θ ,uj (log(N(γ)/δ)/n) nγ + Aθ log(N(γ)/δ) + Cθ log(N(γ)/δ)2/n .

where the last inequality relies on Assumption 4.7 for n Bθ log(N(γ)/δ).

E. Proofs of Section 5 E.1. Proof of Lemma 5.1

Proof. It is direct to see that

qθ (x, y, z)qθ(x, y, z) =

( 0 if (x, y) D(θ , θ) G0(θ ) G0(θ)

pθ (x)pθ (y)pθ(x)pθ(y) otherwise .

Therefore, we have

BC(qθ , qθ) = Z

(x,y)/ D(θ ,θ) (G0(θ ) G0(θ) )

pθ (x)pθ (y)pθ(x)pθ(y)dxdy = BC(p 2 θ , p 2 θ ) f BC(θ , θ) .

Using that H2(P, Q) = 1 BC(P, Q), we conclude the proof.

E.2. Proof of Theorem 5.3 Consider the hypercube Θ = {θb = δb : b {0, 1}d} Θ. Note that θb θb δ

kd H(b, b ), where d H(b, b ) denotes the Hamming distance between b and b . Then using Assouad s lemma we have

1 max d H(b,b )=1 TV (q n θb , q n θb ) .

Upper bounding TV with H2, Lemma 5.1 yields

TV (q n θb , q n θb ) q

n( f BC(θb, θb ) + H2(p 2 θb , p 2 θb )).

Then, Assumption 5.2 implies

αF(k) + 2c2δ2 !

Picking δ = 1 2(c1+2c2) min{ αF(k)

n , 1 n} ensures that the term in parenthesis is always greater than 1/2, hence

k 8(c1 + 2c2) min αF(k)

F. Multivariate Gaussian with Known Covariance In the following, θ = Σ 1µ denote the natural parameter of multivariate Gaussian with known covariance matrix Σ. We have X = Rd and k = d. Let θ Θ and u Sd 1 for the norm Σ, i.e., u Σ = 1. Let S2,d 1 = {x Rd | u 2 = 1}. It is direct to see that

ℓθ(x, y) = log pθ(x)

pθ(y) = x y, θ Σ 1(x + y)/2 and θ ℓθ (x, y) = x y .

Therefore, we have

G0(θ ) = {(x, y) (Rd)2 | | x y, θ (x + y)/2 | > 0} ,

G1(θ ) = {(x, y) G0(θ ) | x y > 0} ,

D(θ , θ) = (x, y) (Rd)2 | x y, θ Σ 1(x + y)/2 2 + x y, θ Σ 1(x + y)/2 θ θ , x y < 0 ,

G1(θ , u) = {(x, y) (Rd)2 | x y, θ Σ 1(x + y)/2 u, x y < 0} ,

(x, y) G1(θ , u), Vθ ,u(x, y) = x y, Σ 1(x + y)/2 θ

Learning Parametric Distributions from Samples and Preferences

Figure 3. Approximations of SP N (0d,Id), SPdet N (0d,Id) and RSPdet N (0d,Id) by (a) αd Id and (b) associated error for varying d. Nruns = 106.

Proof that Pp 2 θ (G1(θ )) > 0. It is direct to see that dim(G0(θ ) ) < 2d and dim(G0(θ ) \ G1(θ )) < 2d. Given that

p 2 θ is a continuous distribution on (Rd)2, we obtain that Pp 2 θ (G1(θ )) = Pp 2 θ (G0(θ )) = 1.

Condition in Lemma 3.1. The condition of Lemma 3.1 is implied by Assumption 4.5, hence we refer to the proof of this result below. Therefore, we have I(qθ ,hsto) I(p 2 θ ).

Consistency of SPdet. To study SPdet for FN,Σ, we use the change of variable D = Σ 1/2(X Y )/

2Σ 1/2(Σθ (X + Y )/2). Then, we have (D, S) N(02d, I2d) and

ℓθ (X, Y ) = S, D , θ ℓθ (X, Y ) =

2Σ1/2D , θ log p 2 θ (X, Y ) = 2Σθ

Let M(D, S) = sign( D, S )σ( | D, S |)D. Then, M( D, S) = M(D, S) for all (D, S) R2d. By integration of an odd function with respect to 0d with a symmetric distribution around 02d, we obtain E(D,S) N(02d,I2d) [M(D, S)] = 0d. Therefore, the condition (3) is satisfied and the SPdet is a consistent estimator.

Asymptotic variance of SPdet. Let HSPdet θ and RSPdet θ defined in Lemma 3.2. Since ℓθ(x, y) = x y, θ (x + y)/2 is linear in θ, we have 2 θ ℓθ = 0d d and HSPdet θ = 0d d. The condition Pp 2 θ (|ℓθ u, θ ℓθ | > 0) > 0 for all u Sd 1 is implied by Assumption 4.5, hence we refer to the proof of this result below. Then, the condition RSPdet θ 0d d is equivalent to M3 0d d where

M3 = E(D,S) N(02d,I2d) [sign( D, S )σ ( | D, S |) (DS T + SD T)] .

When d = 1, we have M3 = 2E(D,S) N(02,I2) [σ ( |D, S|) |DS|] > 0. When d > 1, for all u Sd 1, we have

u TM3u = 2E(D,S) N(02d,I2d) [sign( D, S )σ ( | D, S |) u, D u, S ] ,

By rotational symmetry of N(02d, I2d) and the function to be integrated, showing that minu Sd 1 u TM3u 0 is equivalent to showing that e T 1M3e1 0, i.e.,

E(D,S) N(02d,I2d) [sign( D, S )σ ( | D, S |) D1S1] .

By symmetry, we conjecture that SP N(0d,Id), SPdet N(0d,Id) and RSPdet N(0d,Id) are of the form αd Id where αd > 0 is decreasing in d. Figure 3 validates this conjecture numerically.

Using the sufficient condition derived in Appendix C.2, we have shown that SPdet is asymptotically better than SP.

Proof of Assumption 4.4. Since ℓθ(x, y) = x y, θ (x + y)/2 is linear in θ, we have D(θ , θ) = e D(θ , θ).

Learning Parametric Distributions from Samples and Preferences

Proof of Assumption 4.5 For (X, Y ) p 2 θ , let D = Σ 1/2(X Y )/

2Σ 1/2 ((X + Y )/2 Σθ ). Then, we have (D, S) N(02d, I2d). Defining U = D/ D , we have U U(S2,d 1) is independent of S. Since U = D/ D U(S2,d 1) and Σ 1(x + y)/2 θ = Σ 1/2S/

2, we obtain

P(X,Y ) p 2 θ ((X, Y ) G1(θ , u)) = P(U,S) U(S2,d 1) N(0d,Id) U, S u, Σ1/2U > 0

= PU U(S2,d 1) Σ1/2u, U > 0 /2 + PU U(S2,d 1) Σ1/2u, U < 0 /2 = 1/2 .

where we used that, conditioned on U, U, S N(0, 1) and PX N(0,1)(X < 0) = PX N(0,1)(X > 0) = 1/2. The last equality uses that PU U(S2,d 1) Σ1/2u, U > 0 = PU U(S2,d 1) Σ1/2u, U < 0 = 1/2 by symmetry of the uniform distribution. Therefore, we have shown that Pp 2 θ (G1(θ , u)) = 1/2 for all u S2,d 1.

Proof of Assumption 4.7. Let us define v = Σ1/2u, hence v U(S2,d 1). Let Φ denote the c.d.f. of N(0, 1) and erf(x) = 2Φ(x

2) 1 be the error function. Let ε > 0. Similarly as above, we obtain that

Fθ ,u(ε) = P(X,Y ) p 2 θ (0 < Vθ ,u(X, Y ) ε)

= P(U,S) U(S2,d 1) N(0d,Id)

2EU U(S2,d 1) h 2Φ

2ε| v, U | 1 i = 1

2EU U(S2,d 1) [erf (ε| v, U |)] .

where we use conditioning by U as above. By change of variable, we obtain that

Fθ ,u(ε) = 1 π EU U(S2,d 1)

"Z ε| v,U |

= ε π EU U(S2,d 1)

| v, U | Z 1

0 e x2ε2 v,U 2dx .

Using that 1 x2 e x2 1, we obtain that

ε Fθ ,u(ε) EU U(S2,d 1) [| v, U |] ε2

3 EU U(S2,d 1) | v, U |3 .

Using that Z 1

0 x( 2xε2 v, U 2)e x2ε2 v,U 2dx = e ε2 v,U 2 Z 1

0 e x2ε2 v,U 2dx ,

F θ ,u(ε) = 1 π EU U(S2,d 1) h | v, U |e ε2 v,U 2i and F θ ,u(ε) = 2ε π EU U(S2,d 1) h | v, U |3e ε2 v,U 2i .

Therefore, using Lemma F.1, we have

F θ ,u(0) = 1 π EU U(S2,d 1) [| v, U |] = 2 d 1 Γ(d/2) πΓ((d 1)/2) =d + O(1/

Let us define

EU U(S2,d 1) [| v, U |] 2EU U(S2,d 1) [| v, U |3] and Mθ ,u = 4π

EU U(S2,d 1) [| v, U |3]

EU U(S2,d 1) [| v, U |]5 .

Then, for all ε (0, εθ ,u], we obtain that

F θ ,u(ε) F θ ,u(ε)3 = πε

EU U(S2,d 1) h | v, U |3e ε2 v,U 2i

EU U(S2,d 1) | v, U |e ε2 v,U 2 3

2 EU U(S2,d 1) | v, U |3

EU U(S2,d 1) [| v, U |] ε2EU U(S2,d 1) [| v, U |3] 3 Mθ ,u .

Learning Parametric Distributions from Samples and Preferences

Since we have (F 1 θ ,u) (x) = F θ ,u(F 1 θ ,u(x))

F θ ,u(F 1 θ ,u(x))3 , we obtain

sup x (0,xθ ,u] |(F 1 θ ,u) (x)| Mθ ,u where xθ ,u = Fθ ,u(εθ ,u)

v u u t EU U(S2,d 1) | Σ1/2u, U | 3

2πEU U(S2,d 1) | Σ1/2u, U |3 .

Proof of Assumption 4.2. Let ε = θ θ and u = (θ θ)/ε. Then, we have

Pp 2 θ (D(θ , θ)) = Pp 2 θ (D(θ , θ + εu)) Pp 2 θ (D(θ , θ + εu) G1(θ , u)) = P(X,Y ) p 2 θ (0 < Vθ ,u(X, Y ) < ε)

Using the above computation, we obtain that P(X,Y ) p 2 θ (0 < Vθ ,u(X, Y ) < ε) > 0, hence Pp 2 θ (D(θ , θ)) > 0.

Proof of Assumption 5.2. Using that 1 e x x, we obtain

H2(pθ , pθ) = 1 exp 1

8 θ θ 2 Σ .

First, we notice that dim G0(θ ) G0(θ) < 2d, hence we can show that

(x,y) G0(θ ) G0(θ)

pθ (x)pθ (y)pθ(x)pθ(y)dxdy = 0 .

Second, we see that

x Σθ 2 Σ 1 + y Σθ 2 Σ 1 + x Σθ 2 Σ 1 + y Σθ 2 Σ 1

= x y 2 Σ 1 + θ θ 2 Σ + x + y Σ(θ + θ) 2 Σ 1 ,

D(θ , θ) = (x, y) (Rd)2 | x y, θ Σ 1(x + y)/2 2 + x y, θ Σ 1(x + y)/2 θ θ , x y < 0 ,

Then, we consider the change of variable u = Σ 1/2(x y) and v = Σ 1/2(x + y), whose Jacobian has det(Σ)2 d as absolute value of its determinant. Therefore, we obtain

e 1 4 θ θ 2 Σ f BC(θ , θ) = 1 (4π)d

(u,v) 1 0 < u, Σ1/2θ v/2

Σ1/2(θ θ ), u < 1 e 1

4 v Σ1/2(θ+θ ) 2dudv

( u, v) 1 u, v Σ1/2(θ θ ), u

2 v 2d ud v

= P(X,Y ) N(0d,Id) 2 X, Y Σ1/2(θ θ ), X

= EU U(S2,d 1) h erf | Σ1/2(θ θ ), U | i = 2Fθ ,u(ε)

where the second equality uses the change of variable u = u/

2 and v = (v Σ1/2(θ + θ ))/

2, whose Jacobian has determinant 2d. The third and the fourth re-uses computation done previously with ε = θ θ Σ and u = (θ θ )/ε. Using Lemma F.1 and the above upper bound on Fθ ,u(ε), we obtain

f BC(θ , θ) = 2e ε2/4Fθ ,u(ε) 4 d 1 Γ(d/2) πΓ((d 1)/2) θ θ Σ .

Lemma F.1. Let Γ be the Γ function. Then,

u S2,d 1, EU U(S2,d 1) [| u, U |] = 2 d 1 Γ(d/2) πΓ((d 1)/2) =d + O(1/

Learning Parametric Distributions from Samples and Preferences

Proof. Due to rotational symmetry of the distribution, for any unit vector u,

EU U(S2,d 1) [| u, U |] = EU U(S2,d 1) [| e1, U |] = EU U(S2,d 1) [|U1|] .

The density of U1 is given by

f U1(x) = Γ d

2 (1 x2) d 3

2 , x [ 1, 1],

and the expectation can be computed as

E |U1| = Z 1

1 |x| f U1(x) dx = 2 Z 1

2 (1 x2) d 3

2 dx = 2Γ d

0 x (1 x2) d 3

0 (1 u) d 3

2 = 2 d 1 Γ d

Therefore, for large d, E |U1| =d + O(1/

G. Laplace with Known Scale In the following, θ denote the mean parameter of Laplace distribution with known scale b. We have X = R and k = d = 1. Let θ Θ and u { 1}. It is direct to see that

ℓθ(x, y) = log pθ(x)

pθ(y) = |y θ|/b |x θ|/b = 1

y x if θ < min{x, y} x y if θ > max{x, y} (2θ (x + y))sign(x y) if θ [min{x, y}, max{x, y}] ,

and θ ℓθ (x, y) =

( 0 if θ < min{x, y} or θ > max{x, y} 2 bsign(x y) if θ [min{x, y}, max{x, y}] .

Therefore, we have

G0(θ ) = {(x, y) R2 | ||y θ | |x θ || > 0} ,

G1(θ ) = {(x, y) R2 | θ [min{x, y}, max{x, y}]} ,

G1(θ , u) = {(x, y) R2 | θ [min{x, y}, max{x, y}] u((x + y)/2 θ ) > 0} , e D(θ , θ) = (x, y) R2 | θ [min{x, y}, max{x, y}] 0 < sign(θ θ )((x + y)/2 θ ) < |θ θ | ,

(x, y) G1(θ , u), Vθ ,u(x, y) = u((x + y)/2 θ ) .

When θ > θ, we have

D(θ , θ) = {(x, y) | {θ , θ} [min{x, y}, max{x, y}] θ < (x + y)/2 < θ }

{(x, y) | θ < min{x, y} θ ((x + y)/2, max{x, y}]}

{(x, y) | θ < min{x, y} θ > max{x, y}}

{(x, y) | θ > max{x, y} θ [min{x, y}, (x + y)/2)} .

When θ < θ, we have

D(θ , θ) = {(x, y) | {θ , θ} [min{x, y}, max{x, y}] θ < (x + y)/2 < θ}

{(x, y) | θ > max{x, y} θ [min{x, y}, (x + y)/2)}

{(x, y) | θ < min{x, y} θ > max{x, y}}

{(x, y) | θ < min{x, y} θ ((x + y)/2, max{x, y}]} .

Learning Parametric Distributions from Samples and Preferences

Proof that Pp 2 θ (G1(θ )) > 0. It is direct to see that dim(G0(θ ) ) < 2. Given that p 2 θ is a continuous distribution on (R)2, we obtain that Pp 2 θ (G0(θ )) = 1. Using the symmetry of the Laplace distribution around its mean, we have that

Pp 2 θ (G1(θ )) = Pp 2 θ (( , θ ) (θ , + )) + Pp 2 θ ((θ , + ) ( , θ )) = 1/2 .

Condition in Lemma 3.1. The condition of Lemma 3.1 is implied by Assumption 4.5, hence we refer to the proof of this result below. Therefore, we have I(qθ ,hsto) I(p 2 θ ).

Consistency of SPdet. To study SPdet for FLap,b, we use the change of variable D = θ X and S = θ Y . For all (D, S) G1(0), we have

ℓθ (X, Y ) = 1

b (D + S)sign(S D) , θ ℓθ (X, Y ) = 2

b sign(S D) , θ log p 2 θ (X, Y ) = 0 .

For all (D, S) / G1(0), we have θ ℓθ (X, Y ) = 0 and θ log p 2 θ (X, Y ) = 0. Let M(D, S) = 1 ((D, S) G1(0)) σ( |D + S|/b)sign(D + S). Then, M( D, S) = M(D, S) for all (D, S) R2. By integration of an odd function with respect to 0 with a symmetric distribution around 02, we obtain E(D,S) N(02d,I2d) [M(D, S)] = 0. Therefore, the condition (3) is satisfied and SPdet is a consistent estimator.

Asymptotic variance of SPdet. Let HSPdet θ and RSPdet θ defined in Lemma 3.2. By definition of ℓθ, we obtain 2 θ ℓθ = 0 and HSPdet θ = 0. Moreover, using the above formula, we have θ ℓθ (X, Y ) θ log p 2 θ (X, Y ) = 0 for all (D, S) G1(0), hence we obtain RSPdet θ = 0. The condition Pp 2 θ (|ℓθ u, θ ℓθ | > 0) > 0 for all u Sd 1 is implied by Assumption 4.5, hence we refer to the proof of this result below. Using the sufficient condition derived in Appendix C.2, we have shown that SPdet is asymptotically better than SP.

Proof of Assumption 4.4. Using that e D(θ , θ) G1(θ ), we simply need to show that e D(θ , θ) G1(θ ) D(θ , θ). Let us consider the case θ > θ. Then, we have

e D(θ , θ) = (x, y) R2 | θ [min{x, y}, max{x, y}] θ < (x + y)/2 < θ

= {(x, y) | {θ , θ} [min{x, y}, max{x, y}] θ < (x + y)/2 < θ }

{(x, y) | θ < min{x, y} θ ((x + y)/2, max{x, y}]} = G1(θ ) D(θ , θ) .

The same result follows when θ < θ by using the same argument. In summary, we have shown that e D(θ , θ) = G1(θ ) D(θ , θ) D(θ , θ).

Proof of Assumption 4.5. Using the symmetry of the Laplace distribution around its mean, we have Pp 2 θ (G1(θ , u)) = Pp 2 θ (G1(θ , 1)) for all u { 1}. Then, by integrating for x < y, we obtain

Pp 2 θ (G1(θ , 1)) = 1 2b2

x ( ,θ ) ex/b Z

y (2θ x,+ ) e y/bdy

x ( ,θ ) e2x 2θ /bdx = 1

Proof of Assumption 4.7. Let ε > 0. Using the symmetry of the Laplace distribution around its mean, we have Fθ ,u(ε) = Fθ ,1(ε) for all u { 1}. Similarly as above, by integrating for x < y, we obtain that

Fθ ,1(ε) = P(X,Y ) p 2 θ (0 < Vθ ,1(X, Y ) ε)

x ( ,θ ) ex/b Z

y (2θ x,2ε+2θ x) e y/bdy

x ( ,θ ) e(2x 2θ )/bdx Z

x ( ,θ ) e(2x 2θ 2ε)/bdx

Learning Parametric Distributions from Samples and Preferences

Therefore, we have

F θ ,u(x) = 1

2be 2ε/b , F 1 θ ,u(x) = b

2 log(1 4x) and (F 1 θ ,u) (x) = 8b (1 4x)2 .

Then, we obtain F θ ,u(0) = 1 2b and we can take xθ ,u = 1/8 and Mθ ,u = 32b.

Proof of Assumption 4.2. Let ε = |θ θ| and u = sign(θ θ). Using the above computation, we have

Pp 2 θ (D(θ , θ)) Pp 2 θ

e D(θ , θ + εu) Pp 2 θ

e D(θ , θ + εu) G1(θ , u) = P(X,Y ) p 2 θ (0 < Vθ ,u(X, Y ) < ε)

Using the above computation, we obtain that P(X,Y ) p 2 θ (0 < Vθ ,u(X, Y ) < ε) > 0, hence Pp 2 θ (D(θ , θ)) > 0.

Proof of Assumption 5.2. Using that f(x) = x2 1 + (1 + x)e x is positive on R+, we obtain

H2(pθ , pθ) = 1 1 + |θ θ|

First, we notice that dim G0(θ ) G0(θ) < 2, hence we can show that

(x,y) G0(θ ) G0(θ)

pθ (x)pθ (y)pθ(x)pθ(y)dxdy = 0 .

We consider the case θ < θ since θ > θ is done similarly as f BC(θ , θ) = f BC(θ, θ ). Let ε = θ θ . By integrating for x < y, we have

f BC(θ , θ) = 1 2b2

y 1 (x θ < (x + y)/2 < θ + ε y) e y/bdy dx

+ e (ε+θ )/b

y 1 (y < θ + ε x θ < (x + y)/2) dy dx

y 1 (θ < x < y < θ + ε) dy dx

x 1 (θ < x (x + y)/2 < θ + ε y) dx dy

Direct computation yields

x (θ ε,θ ) ex/b Z

y (2θ x,θ +ε) 1dy

x (θ ε,θ ) ex/b (x + ε θ ) dx = e(θ ε)/b Z

u (0,ε) ueu/bdu ,

u (0,ε) ueu/bdu = b eε/b(ε b) + b ,

y 1 (θ < x < y < θ + ε) dy dx = Z

x (θ ,θ +ε) (θ + ε x)dx = ε2

y (θ +ε,θ +2ε) e y/b Z

x (θ ,2θ +2ε y) 1dx

y (θ +ε,θ +2ε) e y/b (θ + 2ε y) dy

= e (θ +2ε)/b Z

u (0,ε) ueu/bdu .

Learning Parametric Distributions from Samples and Preferences

Moreover, we have Z

y 1 (x θ < (x + y)/2 < θ + ε y) e y/bdy dx

x ( ,θ ε) ex/b Z

y (2θ x,2θ +2ε x) e y/bdy

x (θ ε,θ ) ex/b Z

y (θ +ε,2θ +2ε x) e y/bdy

e (2θ 2x)/b e (2θ +2ε 2x)/b dx + b Z

e (θ +ε x)/b e (2θ +2ε 2x)/b dx

2e 4ε/b + b e ε/b e 2ε/b + b

e 4ε/b e 2ε/b = b2 e ε/b e 2ε/b

Therefore, we have

f BC(θ , θ + ε) = 1

e ε/b e 2ε/b + 1

e 2ε/b + e 2(θ +ε)/b eε/b(ε b) + b + e ε/b ε2

e ε/b e 2ε/b + 1

e 2ε/b + e 2(θ +ε)/b ε

beε/b eε/b + 1 + e ε/b ε2

e 2(θ +ε)/b e (2θ +ε)/b + 1

e ε/b + e (2θ +ε)/b ε

b + e ε/b ε2

2e ε/b e 2θ /b(e ε/b 1 + ε/b) + ε

Then, we can conclude that

f BC(θ , θ ε) = f BC(θ ε, θ ) = 1

2e ε/b e 2(θ ε)/b(e ε/b 1 + ε/b) + ε

Using that f(x) = 1 x + x2/2 e x is positive on R+, we obtain

f BC(θ , θ) |θ θ|

1 + e 2 min{θ ,θ}/b .

H. Supplementary Experiments Using the same empirical setup as in Section 6, we conduct additional experiments to support our theoretical claims for other distributions (Appendix H.1), other estimators for Gaussian distributions based on Cn (Appendix H.2), other convex surrogates of the 0-1 loss (Appendix H.3) or normalized/regularized versions of the logistic loss (Appendix H.4).

Reproducibility. Code for reproducing our empirical results is available at https://github.com/tml-epfl/ learning-parametric-distributions-from-samples-and-preferences. Our code is implemented in Julia (Bezanson et al., 2017), version 1.11.5. The plots are generated with Stats Plots. The optimization problems defining some of our estimators are solved numerically with Ju MP (Lubin et al., 2023), by using the Ipopt (Wächter & Biegler, 2006) and Hi GHS (Huangfu & Hall, 2018) solvers. Other dependencies are listed in the Readme.md that provides detailed julia instructions to reproduce our experiments, as well as a script.sh to run them all at once. Our experiments are conducted on 12 Intel(R) Core(TM) Ultra 7 165U 4.9GHz CPU.

Gaussian distribution with known variance. For FN,1, the SPdet and SP estimators are computed with the Ipopt solver. For FN,Id, the SPdet, SP, DP and WE estimators are computed with the Ipopt solver, and the AE estimator uses the Hi GHS solver.

H.1. Accelerated Rates for Other Distributions H.1.1. LAPLACE DISTRIBUTION WITH KNOWN SCALE Estimators. For FLap,1 (Appendix G), we have

bθSO n = median({Xi}i [n] {Yi}i [n]) and Cn = {θ | i [n], Zi(|Yi θ| |Xi θ|) 0} .

Learning Parametric Distributions from Samples and Preferences

Figure 4. Estimation errors for (a) Lap(θ , 1) where θ U([1, 2]) with Nruns = 10 and (b) Rayleigh(

θ ) where θ U([1, 2]) with Nruns = 102.

The estimators based on Cn are bθAE n Cn, bθWE n := arg maxθ Cn |θ θ | and

bθDP n = arg max θ Cn

i [n] (|Yi θ| + |Xi θ|) .

Those three estimators are computed with the Ipopt solver.

Experiments. Figure 4(a) confirms empirically the difference in estimation rate between the M-estimators (SO MLE) obtaining O(1/ n) and our estimators based on Cn achieving O(1/n). Moreover, AE and WE perform on par with DP MLE.

H.1.2. RAYLEIGH DISTRIBUTION Let σ > 0 be the scale parameter characterizing a Rayleigh distribution. In the following, let θ = 1

2σ2 < 0 denote the natural parameter of a Rayleigh distribution. We have Θ R , X = R+ and k = d = 1. The probability density function is defined as

x R+, pθ(x) = exp x2θ + log(x) + log(2θ) .

Let θ Θ and u { 1}. It is direct to see that, for all (x, y) R2 +,

ℓθ(x, y) = log pθ(x)

pθ(y) = (x2 y2)θ + log(x/y) and dℓθ

dθ (x, y) = x2 y2 = (x y)(x + y) .

Therefore, we have

G0(θ ) = {(x, y) R2 + | |(x2 y2)θ + log(x/y)| > 0} ,

G1(θ ) = {(x, y) R2 + | |x y| > 0} ,

G1(θ , u) = {(x, y) R2 + | u((x2 y2)2θ + (x2 y2) log(x/y)) < 0} ,

D(θ , θ) = {(x, y) R2 + | ((x2 y2)θ + log(x/y))2 + (x2 y2)(θ θ ) (x2 y2)θ + log(x/y) } ,

(x, y) G1(θ , u), Vθ ,u(x, y) = u θ + 1 x + y log(x) log(y)

Proof that Pp 2 θ (G1(θ )) > 0. It is direct to see that dim(G0(θ ) ) < 2 and dim(G0(θ ) \ G1(θ )) < 2. Given that p 2 θ is

a continuous distribution on (R+)2, we obtain that Pp 2 θ (G0(θ )) = Pp 2 θ (G1(θ )) = 1.

Learning Parametric Distributions from Samples and Preferences

Figure 5. Estimation errors for N(θ , Id) where θ U([1, 2]d) with Nruns = 102 for (a) d = 1 and (b) d = 20.

Proof of Assumption 4.4 and 4.5. Since ℓθ(x, y) = (x2 y2)θ + log(x/y) is linear in θ, we have D(θ , θ) = e D(θ , θ). Let (X, Y ) p 2 θ . Then, we have

Pp 2 θ (G1(θ , 1)) = P(X,Y ) p 2 θ

θ < 1 X2 log(Y/X) 1 (Y/X)2

Pp 2 θ (G0(θ , 1)) = P(X,Y ) p 2 θ

θ > 1 X2 log(Y/X) 1 (Y/X)2

Estimators. We have

i [n] (X2 i + Y 2 i ) and Cn = {θ | i [n], Zi((X2 i Y 2 i )θ + log(Xi/Yi)) 0} .

The estimators based on Cn are bθAE n Cn, bθWE n := arg maxθ Cn |θ θ |. Those two estimators are computed with the Ipopt solver.

Experiments. Figure 4(b) confirms empirically the difference in estimation rate between the M-estimators (SO MLE) obtaining O(1/ n) and our estimators based on Cn achieving O(1/n). Moreover, AE and WE perform similarly.

H.2. Other Estimators for Gaussian Distributions To better understand the surprising performance of the RU estimator, we consider other estimators that disentangle the effect of RU s randomness versus its mean behavior.

Univariate Gaussian. The center estimator (CE) returns the center of the interval Cn. The truncated Gaussian estimator (Tr G) returns a realization from a Gaussian distribution with mean CE and variance 4/n, which is truncated to Cn. The truncated MLE (Tr MLE) returns the average of the observations ({Xi}i [n] {Yi}i [n]) Cn.

Figure 5(a) reveals that Tr G performs on par with RU, yet CE and Tr MLE outperform both Tr G and RU. This suggests that being far away from the boundary of Cn improves performance compared to DP that lies on the boundary of Cn (as observed empirically). Moreover, randomization on Cn worsens performance compared to CE.

Using the derivation in the introduction on univariate Gaussian, it is coherent that CE improves on DP by a multiplicative constant: the average of those two (non-independent) random variables decreases faster. Formally, this could be proven by refining the proof of Lemma 4.6 to account for the property that n = Nθ , 1 + Nθ ,1.

Learning Parametric Distributions from Samples and Preferences

Figure 6. Estimation errors as a function of d with N(θ , Id) where θ U([1, 2]d), for n = 104 and Nruns = 102.

Multivariate Gaussian. For d > 1, multiple centers exist. We use the Chebyshev center estimator (CCE) of Cn.

Figures 5(b) and 6 shows that CCE outperforms AE by a constant margin. It only outperforms DP in the regime of large n compared to d and performs worse than SO MLE for small n. Geometrically, for small n and large d, we conjecture that the random polytope Cn is more likely to be spiky along some directions. Due to those distant vertices, the center would become a worse estimator than DP, since the average is intuitively less robust to outliers. In contrast, DP MLE dominates SO MLE statistically (Lemma 4.1), hence it achieves rate O( p

d/n) when n is small compared to d.

H.3. Estimators Based on Convex Surrogate of the 0-1 Loss While DP MLE minimizes an objective that minimizes the 0-1 loss, SP MLE minimizes an objective involving the logistic loss f Log(x) = log(1 + exp( x)). As in Tang et al. (2024b), we can generalize this approach to f any convex surrogate of the 0-1 loss, see Figure 7(a). For example, we consider the Hinge loss (Hin), i.e., f Hin(x) := max{0, 1 x}, the square loss (Squ), i.e., f Squ(x) := (1 x)2, the truncated square loss (Tr S), i.e., f Tr S(x) := max{0, 1 x}2, the Savage loss (Sav), i.e., f Sav(x) := (1 + exp(x)) 2, and the exponential loss (Exp), i.e., f Exp(x) := exp( x).

Given (Xi, Yi, Zi)i [n] q [n] θ ,hdet and a loss f, we consider the estimator

bθf n arg min θ Θ

LSO n (θ) + X

i [n] f(Ziℓθ(Xi, Yi))

All those estimators are computed with the Ipopt solver.

Figure 7(b) shows that all estimators perform on par with SP MLE, i.e., the one based on the logistic loss.

H.4. Impact of Normalization and Regularization The estimator defined in Appendix H.3 can be further generalized by introducing a regularization parameter λ 0 and a normalization parameter β > 0, see, e.g., Gorbatovski et al. (2025). Given (Xi, Yi, Zi)i [n] q [n] θ ,hdet, a loss f and regularization/normalization (λ, β), we consider the estimator

bθf,λ,β n arg min θ Θ

LSO n (θ) + λ X

i [n] f(βZiℓθ(Xi, Yi))

While similar modifications could be made for other losses, we focus on the logistic loss f Log(x) = log(1 + exp( x)). In particular, we recover SP MLE by taking λ = β = 1.

Figures 8(a) and (b) showcase the mild impact of normalization and regularization.

Learning Parametric Distributions from Samples and Preferences

1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.00

Logistic Squared Hinge Exponential Truncated squared Savage 0-1

Figure 7. (a) Figure 2 in Tang et al. (2024b): notable examples of binary classification loss functions. (b) Estimation errors when minimizing the empirical losses for N(θ , 1) where θ U([1, 2]) with Nruns = 10.

Figure 8. Estimation errors when minimizing the empirical losses for N(θ , 1) where θ U([1, 2]) with Nruns = 102 when (a) normalizing by β with regularization λ = 1 and (b) regularizing by λ with normalization β = 1.