# learning_social_welfare_functions__209c0489.pdf

Learning Social Welfare Functions

Kanad Shrikar Pardeshi Carnegie Mellon University kpardesh@andrew.cmu.edu

Itai Shapira Harvard University itaishapira@g.harvard.edu

Ariel D. Procaccia Harvard University arielpro@seas.harvard.edu

Aarti Singh Carnegie Mellon University aarti@andrew.cmu.edu

Is it possible to understand or imitate a policy maker s rationale by looking at past decisions they made? We formalize this question as the problem of learning social welfare functions belonging to the well-studied family of power mean functions. We focus on two learning tasks; in the first, the input is vectors of utilities of an action (decision or policy) for individuals in a group and their associated social welfare as judged by a policy maker, whereas in the second, the input is pairwise comparisons between the welfares associated with a given pair of utility vectors. We show that power mean functions are learnable with polynomial sample complexity in both cases, even if the social welfare information is noisy. Finally, we design practical algorithms for these tasks and evaluate their performance.

1 Introduction

Consider a standard decision making setting that includes a set of possible actions (decisions or policies), and a set of individuals who assign utilities to the actions. A social welfare function aggregates the utilities into a single number, providing a measure for the evaluation of actions with respect to the entire group. Utilitarian social welfare, for example, is the sum of utilities, whereas egalitarian social welfare is the minimum utility. Given two actions that induce the utility vectors (3, 0) and (1, 1) for two individuals, the former is preferred when measured by utilitarian social welfare, whereas the latter is preferred according to egalitarian social welfare.

When competent decision makers adopt policies that affect groups or even entire societies, they may have a social welfare function in mind, but it is typically implicit. Our goal is to learn a social welfare function that is consistent with the decision maker s rationale. This learned social welfare function has at least two compelling applications: first, understanding the decision maker s priorities and ideas of fairness, and second, potentially imitating a successful decision maker s policy choices in future dilemmas or in other domains.

As a motivating example, consider the thousands of decisions made by public health officials in the United States during the Covid-19 pandemic: opening and closing schools, restaurants, and gyms, requirements for masking and social distancing, lockdown recommendations, and so on. Each decision induces utilities for individuals in the population; closing schools, for instance, provides higher utility to medically vulnerable individuals compared to opening them, but arguably has much lower utility for students and parents. Assuming that healthcare officials were acting in the public interest and (approximately) optimizing a social welfare function, which one did they have in mind? Our goal is to answer such questions by learning from example decisions.

Another example we consider in this paper is that of allocating food resources in a community by a US-based nonprofit to hundreds of recipient organizations. Working with a dataset of utility of 18

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Table 1: A summary of our results regarding the sample complexity of various tasks. Here, ξ = umax (umax umin) and κ = log(umax/umin), with all d individual utilities assumed to be in the range [umin, umax]. ρ [0, 1/2) is the probability of mislabeling for the i.i.d noise model, and τmax is the maximum temperature of the logistic noise model.

Social Welfare Information Loss Known Weights Unknown Weights

Cardinal values ℓ2 O(ξ2) O(ξ2d log d) Pairwise comparisons 0-1 O(log d) O(d log d) Pairwise comparison with i.i.d noise 0-1 O log d (1 2ρ)2 O d log d (1 2ρ)2

Pairwise comparisons with logistic noise estimation Logistic O(τ 2 maxκ2) O(τ 2 maxκ2d log d)

different stakeholders such as donors, volunteers, dispatchers and recipient organizations [11], we consider the task of learning the social welfare implicit in the decisions that may be made by the nonprofit.

In order to formalize this problem, there are two issues we need to address. First, to facilitate sample-efficient learnability, we need to make some structural assumptions on the class of social welfare functions. We focus on the class of weighted power mean functions, which includes the most prominent social welfare functions: the aforementioned utilitarian and egalitarian welfare, as well as Nash welfare (the product of utilities). This class is a natural choice, as it is the only class of functions feasible under a set of reasonable social choice axioms such as monotonicity, symmetry, and scale invariance [20, 7].

Second, we need to specify the input to our learning problem. There are two natural options, and we explore both: utility vectors coupled with their values under a target social welfare function, or pairwise comparisons between utility vectors. We demonstrate sample complexity bounds for both types of inputs, where the social welfare value or comparisons can be noiseless or corrupted by noise. We note that estimating the utility vector associated with any particular decision or policy is ostensibly challenging, but in fact this has been done in prior work and we have access to relevant data, as we discuss in Section 6.

Our contributions. Learning weighted power mean functions is a non-standard regression or classification problem due to the complex, highly nonlinear dependence on the power parameter p, which is the parameter of interest. While one can invoke standard hyperparameter selection approaches such as cross-validation to select p from a grid of values, the infinite domain of p does not allow demonstration of a polynomial sample complexity without deriving an appropriate cover. We derive statistical complexity measures such as pseudo-dimension, covering number, VC dimension and Rademacher complexity for this function class, under both cardinal and ordinal observations of the social welfare function. Our sample complexity bounds are summarized in Table 1. These results may be of interest for other problems where weighted power mean functions are used, such as fairness in federated learning [12].

We highlight some key contributions of this paper. We first establish the statistical learnability of widely used social welfare functions belonging to the weighted power mean functions family. We derive a polynomial sample complexity of O(1) for learning using cardinal social welfare values under ℓ2 loss, and O(log d) (where d denotes the number of individuals) for learning using comparisons under 0 1 loss in the unweighted/known weight setting. The upper bounds leverage the monotonicity of the target functions with p in the cardinal case, and the restricted number of roots in the ordinal setting. We also prove matching lower bounds for the ordinal case.

We also establish a polynomial sample complexity of O(d log d) for both cardinal and ordinal tasks in the setting when the individual weights are unknown. This result is intuitive, as learning an additional d weight parameters incurs a proportional increase in the sample requirement.

We then analyze the sample complexity for the more practical ordinal task under different noise models (i.i.d. and logistic noise) and characterize the effect of noise on learning. In the i.i.d. noise setting, sample complexity increases with noise level ρ, converging to the noiseless complexity as ρ 0. Unlike the i.i.d. case where ρ is known, in the logistic noise model, we also consider estimating the noise level τ and evaluate the likelihood with respect to the noisy distribution. As τ increases, estimating the noise becomes more challenging, leading to higher sample complexity.

Finally, despite the problem s non-convexity, we demonstrate a practical algorithm for learning weighted power mean functions across tasks using simulated data and a real-world food resource allocation dataset from Lee et al. [11]. Our empirical results validate theoretical bounds and highlight algorithm performance across parameter settings.

Related work. Conceptually, our work is related to that of Procaccia et al. [18], who study the learnability of decision rules that aggregate individual utilities. In their work, however, individual utilities are represented as rankings over a set of alternatives (rather than cardinal utilities as in our case), and the rule to be learned is a voting rule that maps input rankings to a winning alternative. They provide sample complexity results for two families of voting rules: positional scoring rules and voting trees.

Basu and Echenique [1] derive VC dimension bounds for additive, Choquet, and max-min expected utility for decision-making under uncertainty, bounding the number of pairwise comparisons needed to falsify a candidate decision rule and estabilishing learnability for these classes. Their work addresses decision rules operating on probability distributions rather than utility vectors, resulting in technical distinctions from ours; for instance, the max-min rule is not learnable in their setting (infinite VC dimension), whereas it is learnable in ours.

Kalai [10] studies the learnability of choice functions and establishes PAC guaranteees. Choice functions are defined with respect to a fixed and finite set of alternatives X, with each sample being a subset from X and the choice over this subset. In contrast, our approach involves learning a function over an infinite action space where utilities are known.

Pellegrini et al. [17] conducts experiments on learning aggregation functions which are assumed to be a composition of Lp means, observing that they perform favorably in various tasks such as scalar aggregation, set expansion and graph tasks. Our work provides a more theoretical analysis, proving the sample-efficient learnability of weighted power mean aggregation functions.

Melnikov and H ullermeier [14] considers learning from actions with feature vectors and their global scores, with local scores for each individual unavailable for learning. They learn both local and global score functions, and consider the ordered weighted averaging operator for aggregating local scores. While we assume that each individual s local score is given, the aggregation function belongs to a richer function family motivated by social choice theory.

2 Problem Setup

We assume that the decision-making process concerns d individuals. The decision-making setting we consider has each action associated with a positive utility vector u [umin, umax]d Rd +, which describes the utilities derived from the d individuals.

We encode the impact of each individual i [d] on the decision-making process through a weight value wi 0 such that Pd i=1 wi = 1. These weight values together form a weight vector w d 1. The weight vector might be a known or unknown quantity. A common instance in which the weight vector is known is when all agents are assumed to have an equal say, in which case w = 1d/d. For all settings we consider, we provide PAC guarantees for both known weights and unknown weights.

We assume that the decision-making process provides a cardinal social welfare value to each action. However, this social welfare value can be latent and need not be available to us as data. For the first task concerned with cardinal decision values, the social welfare values are available and can be used for learning. For the second task, both actions in the pair have a latent social welfare which is not available to us; however, the preferred action in the pair is known to us. We consider learning bounds with the empirical risk minimization (ERM) algorithm for all the losses in this work, with ˆp being learned when the weights are known, and ( ˆw, ˆp) being learned when the weights are unknown.

Power Mean. The (weighted) power mean is defined on p R { }, and for u Rd +, w d 1, it is

M(u; w, p) =

Pd i=1 wiup i 1/p p = 0 Qd i=1 uwi i p = 0

It is sometimes more convenient to use the (natural) log power mean than the power mean. Since Pd i=1 wi = 1, in effect we have d variables, w1, . . . , wd 1 and p. We refer to the weighted power mean family with known weight w as

Mw,d = {M( ; w, p)|p R} .

If the weight is unknown, the weighted power mean family is denoted by

Md = {M( ; w, p)|p R, w d 1} .

The power mean family is a natural representation for social welfare functions. Cousins [7, 6] puts forward a set of axioms under which the set of possible welfare functions is precisely the weighted power mean family. An unweighted version of these functions results in the family of constant elasticity of substitution (CES) welfare functions [8], which are widely studied in econometrics.

To show the generality of this family of functions, we list a few illustrative cases:

M(u; w, p = ) = mini d ui, which corresponds to egalitarian social welfare.

M(u; w, p = 0) = Qd i=1 uwi i , which corresponds to a weighted version of Nash social welfare. M(u; w, p = 1) = P

i=1 wiui, which corresponds to weighted utilitarian welfare. M(u; w, p = ) = maxi d ui, which corresponds to egalitarian social welfare.

We note that for p = , the decision utility is independent of w. With wi = 1/d for all i [d], we get the conventional interpretations of the welfare notions mentioned above.

The power mean family has some useful properties. An obvious one is that M(u, w, p) [u(1), u(d)], where u(1) and u(d) denote the first and d-th order statistics of u = (u1, . . . , un). u(1) is attained at p = , and u(d) is attained at p = . A more general observation is the following:

Lemma 2.1. (a) M(u; w, p) is nondecreasing with respect to p for all u [umin, umax]d, w d 1.

(b) M(u; w, p) is monotonic with respect to wi for each i [d 1], for all u [umin, umax]d, p R.

(c) M(u; w, p) and log M(u; w, p) log M(v; w, p) are quasilinear with respect to w if p is fixed.

This monotonicity of the power mean in w and p was also noted by Qi et al. [19]. A proof for the above lemma is provided in Appendix A.1.

3 Cardinal Social Welfare

We first consider the case where we know the cardinal value of the social choice associated with each action. Learning in this setting thus corresponds to regression. Formally, we assume an underlying distribution D : [umin, umax]d [umin, umax] over the utilities and social welfare values. We receive i.i.d samples {(ui, yi)}n i=1 Dn, ui being the utility vector and yi [ui(1), ui(d)] being the social welfare value associated with action i.

We consider the ℓ2 loss over M(ui; w, p) and yi. The true risk in this case is

R(w, p) = E(u,y) D h (M(u; w, p) y)2i .

To analyze the PAC learnability of this setting, we first provide bounds on the pseudo-dimensions1 of Mw,d and Md. We begin by noting that

M(u; w, p) = u(d) M (r; w, p) , where r [d] and ri = ui

1For a formal definition of pseudo-dimension, refer to Definition A.2. A comprehensive review can be found in Vidyasagar and Vidyasagar [22].

Since M(r; w, p) [0, 1], we can determine the pseudo-dimensions of this function class.

We now define the function classes

Sw,d = {f(u; w, p) = M(r; w, p) | (w, p) d 1 R} , Sd = {f(u; w, p) = M(r; w, p) | p R} .

We then have the following bounds on pseudo-dimensions: Lemma 3.1. (a) If w is known, then Pdim(Sw,d) = 1.

(b) If w is not known, then Pdim(Sd) < 8d(log2 d + 1).

A detailed proof is provided in Appendix A.3.

We highlight the fact that p and w are the parameters of the log power mean function family, which calls for the novel bounds provided in this work. These bounds on the pseudo-dimensions can now be used to obtain PAC bounds: Theorem 3.2. Given a set of samples {(ui, yi)}n i=1 drawn from a distribution Dn, for any δ > 0, the following holds with probability at least 1 δ with respect to the ℓ2 loss function:

(a) If w is known, then

R(w, ˆp) inf p R R(w, p) 16ξ

2 log 2 + 2 log n

(b) If w is unknown, then

R( ˆw, ˆp) inf (w,p) d 1 R R(w, p) 16ξ

2 log 2 + 16(d log2 d + 1) log n

where ξ = umax (umax umin). For the complete proof, see Appendix A.5. Below, we provide a proof sketch.

Proof Sketch. We first use the pseudo-dimensions found above to bound the Rademacher complexity of Md and Mw,d in Lemma A.6. Since M(ui; w, p) [umin, umax] and yi [umin, umax], the ℓ2 loss function in this case has domain [umin umax, umax umin]. It is Lipschitz continuous on this domain with Lipschitz constant 2ξ. Using Lemma A.6 and Talagrand s contraction lemma, we obtain the bounds ˆR(ℓ Mw,d) 2(umax umin) ˆR(Mw,d) and ˆR(ℓ Md) 2(umax umin) ˆR(Md).

These Rademacher complexity bounds are then used to obtain the uniform convergence bounds above.

These bounds are distribution-free, with the only assumption being that all utilities and social welfare values are in the range [umin, umax]. They also imply an O(1) and O(d log d) dependence of sample complexity on d for known and unknown weights respectively. Moreover, we observe the dependence of the upper bound on umax umin for the ℓ2 loss. We note that when umax = umin = u0, all utilities and social welfare function values are also u0. In this case, the Rademacher complexity bound is also zero, which is expected.

Computationally, M(u; w, p) is non-convex in w and p, which means that the ℓ2 loss is also nonconvex. However, we observe that from Lemma 2.1 (c), M(u; w, p) is quasilinear w.r.t. w with fixed p, which makes the ℓ2 loss function quasi-convex for all (u, y)2. We use this fact to construct a practical algorithm.

A shortcoming of this setting is that decision-makers are required to provide a social welfare value for each action. A more natural setting might be when decision-makers only provide their preferences between actions potentially just their revealed preferences, i.e., the choices they have made in the past and we address this case next.

2A detailed explanation of the quasi-convexity of ℓ2 loss is provided in Appendix A.2.1.

4 Pairwise Preference Between Actions

For this setting, we assume an underlying distribution D : [umin, umax]d [umin, umax]d { 1} . We obtain i.i.d. samples {((ui, vi), yi)}n i=1 Dn, where (ui, vi) are the utilities for the i-th pair of actions, and yi is a comparison between their (latent) social choice values. We encode the comparison function as C : [umin, umax]d [umin, umax]d { 1}, with

C((u, v); w, p) = sign (log M(u; w, p) log M(v; w, p)) .

We denote the family of above functions by Cw,d = {C((u, v); w, p) : p R} when the weights are known, and Cd = {C((u, v); w, p) : p R, w d 1} when the weights are unknown. We consider learning with 0 1 loss over C((ui, vi); w, p) and yi. The true risk in this case is

R(w, p) = E((u,v),y) D

(1 + y C((u, v); w, p))

To provide convergence guarantees for the above setting, we bound the VC dimension of the comparison-based function classes mentioned above Lemma 4.1. (a) If w is known, then VC(Cw,d) < 2(log2 d + 1). (b) If w is unknown, then VC(Cd) < 8(d log2 d + 1).

(c) (Lower bounds): VC(Cd) log2 d + 1, and VC(Cw,d) d 1

The detailed proof of the above lemma is provided in Appendix A.6.

We find the asymptotically tight lower bound for the known weights case rather surprising, as it is a priori unclear that the correct bound should be superconstant and scale with d.

The finiteness of VC dimension guarantees PAC learnability, and we get uniform convergence bounds using the VC theorem. Theorem 4.2. Given samples {((ui, vi), yi)}n i=1 Dn and for 0-1 loss and any δ > 0, with probability at least 1 δ,

(a) If w is known, then

R(w, ˆp) inf p R R(w, p) 16

2(log2 d + 1) log(n + 1) + log(8/δ)

(b) If w is unknown, then

R( ˆw, ˆp) inf (w,p) d 1 R R(w, p) 16

8(d log2 d + 1) log(n + 1) + log(8/δ)

We note that unlike the bounds on ℓ2 loss of Theorem 3.2, these bounds on 0-1 loss are independent of the range of utility values and only depend on d. They provide sample complexity bounds which depend on d as O(log d) and O(d log d) for known and unknown weights respectively. Despite these PAC guarantees, empirical risk minimization can be particularly difficult in this case, since the loss function as well as the function class log M(u; w, p) log M(v; w, p) can be non-convex. To illustrate this non-convexity, we plot the value of the above function for two pairs of utility vectors with respect to p in Figure 6, with d = 6 and w = 1d/d. However, the quasilinearity of log M(u; w, p) log M(v; w, p) with fixed p can be used to design efficient algorithms.

4.1 Convergence Bounds Under I.I.D Noise

Decision making can be especially challenging if two actions are difficult to compare, and the preference data we obtain can potentially be noisy. We first consider each comparison to be mislabeled in an i.i.d. manner with known probability ρ [0, 1/2). We make use of the framework developed by Natarajan et al. [16], and we consider convergence guarantees under 0-1 loss.

Specifically, the unbiased estimator of ℓ0 1 is

ℓ0 1(t, y) = (1 ρ)ℓ0 1(t, y) ρℓ0 1(t, y)

We conduct ERM with respect to ℓ0 1 to obtain ( ˆw, ˆp) d 1 R (only learning p if weights are known). We observe that ℓ0 1(t, y) = (1 + ty)/2 is 1/2-Lipschitz in t, t, y { 1}. Using Theorem 3 of Natarajan et al. [16], we get the following convergence bounds:

Theorem 4.3. Given samples {((ui, vi), yi)}n i=1 Dn, for any δ > 0 and for any ρ [0, 1/2), with probability at least 1 δ with respect to 0-1 loss,

(a) If w is known, then

R( ˆw, ˆp) inf p R R(w, p) 8 1 2ρ

(log2 d + 1) log(n + 1)

(b) If w is unknown, then

R( ˆw, ˆp) inf (w,p) d 1 R R(w, p) 16 1 2ρ

(d log2 d + 1) log(n + 1)

A detailed proof of the above theorem is provided in Appendix A.7.

We note that although ERM is conducted with respect to ℓ0 1 on the noisy distribution, the risks are defined on the underlying noiseless distribution. This gives O(log d/(1 2ρ)2) and O(d log d/(1 2ρ)2) sample complexities for the known and unknown weights cases respectively. We note that when ρ = 0, the above bounds reduce to the noiseless bounds in Theorem 4.2. Since the noise level ρ is usually not known to us, it can be estimated using cross-validation as suggested by Natarajan et al. [16].

However, conducting ERM on ℓ0 1 might be prohibitively difficult due to the non-convex nature of the function. An i.i.d noise model might also be inappropriate in certain settings; we next consider a more natural noise model.

5 Pairwise Preference With Logistic Noise

Intuitively, we expect that two actions would be harder to compare if their social welfare values are closer to each other. We formalize this intuition in the form of a noise model inspired by the BTL noise model [4, 13]. Let w and p be the true power mean parameters, and let τ [0, τmax] be a temperature parameter. For an action pair (u, v), we assume that the probability of u being preferred to v is

P (y = 1|(u, v); w , p , τ ) = 1 1 + exp ( τ (log M(u; w , p ) log M(v; w , p )) (1)

We see that a larger difference between the log power means of u and v translates to a higher probability of u being preferred. If u and v lie on the same level set of log M( ; w , p ), the probability becomes 0.5, which matches the intuition of both actions being equally preferred. We also note the dependence of the probability on τ : a higher τ corresponds to more confidence in the preferences, with τ = 0 meaning indifference for all pairs of actions. The mislabeling probability is also invariant to scaling of u and v.

Our learning task now becomes estimating w, p and τ given data. We denote the function family in this case by Tw,d = {τ (log M( ; w, p) log M( ; w, p)) |τ, p}

when the weights are known, and

Td = {τ (log M( ; w, p) log M( ; w, p)) |τ, w, p}

when the weights are unknown. A natural loss function to consider in this case is negative log likelihood, and we consider PAC learnability with this loss. Using the framework developed in Section 3, we obtain the following PAC bounds:

Theorem 5.1. Given samples {((ui, vi), yi)}n i=1 Dn and for negative log likelihood loss, for all δ > 0, with probability at least 1 δ,

(a) If w is known, then

R(w, ˆp) inf p R R(w, p) 16τmaxκ

2 log 2 + 2 log n

(b) If w is unknown, then

R( ˆw, ˆp) inf (w,p) d 1 R R(w, p) 16τmaxκ

2 log 2 + 16(d log2 d + 1) log n

+ 16τmaxκ 2c n + 3

where κ = log(umax/umin).

We derive this result in detail in Appendix A.8.

This gives us sample complexity bounds of O(1) and O(d log d) with respect to d for the known and unknown weights cases respectively, thus establishing PAC learnability. An important distinction between Theorem 4.3 and the above theorem is that Theorem 4.3 bounds risk with respect to 0-1 loss, while the above theorem bounds risk with respect to logistic loss which is continuous and hence easier to control. Moreover, we estimate the noise level τ in the logistic case along with w and p, whereas Theorem 4.3 is concerned with estimating w and p.

As with the previous cases, non-convexity in this setting also makes global optimization with respect to w and p (and hence ERM) difficult. We observe that logistic loss is quasilinear in w with fixed p3, and this observation can be used to construct an effective algorithm.

6 Empirical Results

We conduct several simulations on semi-synthetic data to gain additional insight into sample complexity and demonstrate an empirically effective algorithm. The implementation also serves to demonstrate the practicability of our approach, including the availability of individual utility functions.

Data. The dataset we rely on (which is not publicly available) comes from the work of Lee et al. [11] with a US-based nonprofit that operates an on-demand donation transportation service supported by volunteers. We Build AI is a participatory framework that enables stakeholders, including donors, volunteers, recipient organizations, and nonprofit staff, to collaboratively design algorithmic policies for allocating donations. Donors provide food donations, volunteers transport the donations, recipient organizations receive and distribute the food, and dispatchers (nonprofit staff) manage the allocation and logistics. The actions are hundreds of recipient organizations that may receive an incoming donation.

As part of this framework, Lee et al. [11] learned a (verifiably realistic) utility function over the actions for each of 18 stakeholders from the different groups based on 8 features: travel time between donors and recipients, recipient organization size, USDA-defined food access levels in recipient neighborhoods, median household income, poverty rates, the number of weeks since the last donation, the total number of donations received in the last three months, and the type of donation (common or uncommon).

In our simulations, we use the values of these stakeholder utility functions learned by Lee et al. [11] as the utility vectors. We fix a p and weight vector w to generate the social welfare values M(u; w, p). We use noisy versions of these social welfare values in the cardinal case, whereas noisy pairwise comparisons between random pairs of utility vectors are used in the ordinal case.

Algorithm. As noted in previous sections, ℓ2 and logistic losses are quasiconvex with respect to w for single samples when p is fixed. Although the sum of quasiconvex functions is not guaranteed to be quasiconvex, we empirically observe that gradient descent on the loss function applied to the data can still lead to convergence to a minimum which has empirical risk comparable to that of the true parameters. As our simulations show, this minimum increasingly resembles w (the real weight) with decreasing noise. Thus, our algorithm consists of performing a grid search on p and conducting gradient descent on w for each p. We provide more details about the algorithm in Appendix B.

Cardinal case. We consider p = 2.72 and a random weight w . We then add Gaussian noise with standard deviation ui(d) ui(1) ν to each sample, where ν corresponds to the noise level. The Gaussian noise is clipped to stay within ui(1), ui(d) . Finally, we learn p and w using our algorithm, and we present the results in Figure 1.

3A detailed explanation of this fact is provided in Appendix A.2.2

50 100 200 400 800 Number of samples

(a) Test loss

50 100 200 400 800 Number of samples

KL divergence

(b) KL(w ˆw)

50 100 200 400 800 Number of samples

(c) Learnt p

0.0 0.001 0.01 0.1 0.167

Figure 1: Results for cardinal case with number of samples. Different lines show results for different values of added noise ν. Solid lines correspond to values for learned parameters, whereas dotted lines correspond to values for real parameters.

500 1000 2000 4000 8000 Number of samples

(a) Test loss

500 1000 2000 4000 8000 Number of samples

KL divergence

(b) KL(w ˆw)

500 1000 2000 4000 8000 Number of samples

Test accuracy (noiseless)

(c) Noiseless test accuracy

0.1 1.0 10.0 40.0 100.0

Figure 2: Results for ordinal case with number of samples. Different lines show results for different values of noise level τ. Solid lines correspond to values for learned parameters, whereas dotted lines correspond to values for real parameters.

In Figure 1a, we observe that the test loss for learned parameters decreases with decreasing noise and increasing number of samples. We also observe that the test loss for learned parameters closely matches that for real parameters in Figure 1a. In Figure 1b, we observe that KL divergence between the true and learnt weights decreases uniformly with decreasing noise and increasing number of samples. This supports the fact that our algorithm is indeed able to find the correct minimum. We also plot the trend of mean learned p in Figure 1c, and we observe that the learned p increasingly resembles the real p with lower noise and greater number of samples. Plots for train loss and loss on noiseless test data are provided in Appendix C.

Ordinal case. We consider p = 1.62 and a random weight w . We compare each sample in the considered training data with 10 other randomly chosen samples, with the comparisons being noised according to the logistic noise model in Equation (1). We then learn w and τ for each p in the chosen grid and then choose the best p. Our results are shown in Figure 2.

In Figure 2a we observe that the test loss for learned parameters matches that for real parameters for small τ and a large number of training samples. The relative deviation between test losses progressively increases for smaller numbers of samples and smaller τ . We note that small τ corresponds to more noise in the comparisons, which results in higher losses. However, the deviation between learned loss and true loss is smaller, as we are also estimating the noise parameter, which is easier to estimate for small τ , since the logistic function has a larger gradient.

We observe a uniform decrease in KL divergence between w and w for a larger number of samples and smaller τ , again pointing to the effectiveness of the algorithm. We also observe that test accuracy on noiseless data increases with more samples and higher τ . Interestingly, for τ = 0.1 and τ = 1, the test accuracy on noiseless data (Figure 2c) is significantly higher than that on (noisy) test data, another indicator of effective ERM being conducted by the algorithm.

In Figure 4b in Appendix C, we observe greater variation in learned p compared to the cardinal case. A possible reason behind this is that changes in p result in smaller changes in losses for negative p than for positive p. This hypothesis is supported by simulations for the ordinal case conducted for p = 1.62, with results presented in Figure 5. In Figure 5e, we observe that learned p is much more consistent with the real p as τ decreases.

We also conduct simulations on fully synthetic data to study the effect of d, and we present the results in Appendix E. We verify the theoretical O(d log d) scaling of error with unknown weights for the ordinal case in Figure 8.

7 Discussion

Our work has (at least) several limitations, which can inspire future work. First, as seen in Section 6, we are able to gain access to realistic utility vectors, in this case ones based on models that were learned from pairwise comparisons. Utilities are also routinely estimated for other economicallymotivated algorithms say, Stackelberg security games [21].

However, these estimates are of course not completely accurate. It is an interesting direction of future work to extend our results to the setting where the utility vectors need to be estimated, either by an outside expert, or using input from the individuals themselves.

Although our experiments demonstrate convergence of the algorithm to the correct minimum, rigorous theoretical analysis about the nature of minima for the ℓ2 and logistic loss functions is still needed and could lead to algorithmic improvements. One issue is that scaling the algorithm to the national scale d = 108, say, can be prohibitively expensive.

Finally, our work only applies to weighted power mean functions. While we have argued that this family is both expressive and natural, it would be exciting to obtain results for even broader, potentially non-parametric families of social welfare functions.

The ability to learn social welfare functions can enable us to understand a decision maker s priorities and ideas of fairness, based on past decisions they have made. This has direct societal impact as these notions can be used to both understand biases and inform the design of improved fairness metrics. A second potential application is to imitate a successful decision maker s policy choices in future dilemmas or in other domains. This may pose some ethical questions if the learning model is misspecified; however, the restriction of the function class to weighted power means, which is inspired by natural social choice theory axioms, mitigates this risk.

Acknowledgements

This research was partially supported by the National Science Foundation under grants IIS-2147187, IIS-2229881, and CCF-2007080; by the Office of Naval Research under grants N00014-20-1-2488, N00014-24-1-2704, and N00014-22-1-2181; and by the USDA under grant 2021-67021-35329.

[1] Pathikrit Basu and Federico Echenique. On the falsifiability and learnability of decision theories. Theoretical Economics, 15(4):1279 1305, 2020.

[2] Girish S Bhat and Carla D Savage. Balanced Gray codes. the Electronic Journal of Combinatorics, 3(1):R25, 1996.

[3] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[4] Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324, 1952.

[5] Laurent Condat. Fast projection onto the simplex and the l 1 ball. Mathematical Programming, 158(1):575 585, 2016.

[6] Cyrus Cousins. An axiomatic theory of provably-fair welfare-centric machine learning. Advances in Neural Information Processing Systems, 34:16610 16621, 2021.

[7] Cyrus Cousins. Revisiting fair PAC learning and the axioms of cardinal welfare. In International Conference on Artificial Intelligence and Statistics, pages 6422 6442, 2023.

[8] Ashish Goel, Reyna Hulett, and Benjamin Plaut. Markets beyond Nash welfare for Leontief utilities. In 15th International Conference on Web and Internet Economics, 2019.

[9] Graham JO Jameson. Counting zeros of generalised polynomials: Descartes rule of signs and Laguerre s extensions. The Mathematical Gazette, 90(518):223 234, 2006.

[10] Gil Kalai. Statistical learnability and rationality of choice. Technical report, Institute of Mathematics and Center for Rationality, The Hebrew University of Jerusalem, 2001.

[11] Min Kyung Lee, Daniel Kusbit, Anson Kahng, Ji Tae Kim, Xinran Yuan, Allissa Chan, Daniel See, Ritesh Noothigattu, Siheon Lee, Alexandros Psomas, et al. We Build AI: Participatory framework for algorithmic governance. Proceedings of the ACM on human-computer interaction, 3(CSCW):1 35, 2019.

[12] T. Li, M. Sanjabi, A. Beirami, and V. Smith. Fair resource allocation in federated learning. In 8th International Conference on Learning Representations, 2020.

[13] R.D. Luce. Individual Choice Behavior: A Theoretical Analysis. Dover, 2005.

[14] Vitalik Melnikov and Eyke H ullermeier. Learning to aggregate: Tackling the aggregation/disaggregation problem for OWA. In Proceedings of The Eleventh Asian Conference on Machine Learning, volume 101 of Proceedings of Machine Learning Research, pages 1110 1125, 2019.

[15] Mehryar Mohri. Foundations of machine learning, 2018.

[16] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. Advances in Neural Information Processing Systems, 26, 2013.

[17] Giovanni Pellegrini, Alessandro Tibo, Paolo Frasconi, Andrea Passerini, and Manfred Jaeger. Learning aggregation functions. In Proceedings of the 30th International Joint Conference on Artificial Intelligence, pages 2892 2898, 2021.

[18] Ariel D. Procaccia, Aviv Zohar, Yoni Peleg, and Jeffrey S. Rosenschein. The learnability of voting rules. Artificial Intelligence, 173(12):1133 1149, 2009.

[19] Feng Qi, Jia-Qiang Mei, Da-Feng Xia, and Sen-Lin Xu. New proofs of weighted power mean inequalities and monotonicity for generalized weighted mean values. Mathematical Inequalities and Applications, 3:377 , 09 2000.

[20] Kevin W. S. Roberts. Interpersonal comparability and social choice theory. The Review of Economic Studies, 47(2):421 439, 1980.

[21] M. Tambe, editor. Security and Game Theory: Algorithms, Deployed Systems, Lessons Learned. Cambridge University Press, 2012.

[22] M Vidyasagar and M Vidyasagar. Vapnik-chervonenkis, pseudo-and fat-shattering dimensions. Learning and Generalisation: With Applications to Neural Networks, pages 115 147, 2003.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Section 2 establishes the problem s theoretical setting. Section 3 deals with PAC learnability in the cardinal case, while sections 4 and 5 address PAC learnability in the ordinal case. Section 6 and Appendix B provide the algorithm and evaluation on semi-synthetic data. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Limitations are discussed in detail in Section 7 Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate Limitations section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: Proofs for all theorems are provided in Appendix A.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes] Justification: The dataset used is described in Section 6, and the details of the algorithm used are provided in B. We have also provided code for the semi-synthetic experiments. For reproducibility, all experiments are run with a seed, and the used seed values are specified in the code.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We rely on a dataset shared with us by Lee et al. [11]. It is not publicly accessible because it includes potentially sensitive information. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: The data generation process is explained in Section 6. Details about the algorithm, in particular steps taken to ensure convergence, are provided in B Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We use the default error bars provided by the Seaborn library for the plots in Section 6 and Appendices C and E. These error bars describe the variation in observed metrics for different random splits between test and train data. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer Yes if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Compute resources information is provided in Appendix B. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: Potential societal impacts of the work are discussed in Section 7 Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: We cite and describe the Food Rescue data used for semi-synthetic experiments in Section 6

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification:

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.

A Deferred Proofs

A.1 Proof of Lemma 2.1

Proof of Lemma 2.1 (a). Let p > 0. Define ti = ui umax for all i [n]. Since ti 1 for all i [n] and given that Pn i=1 wi = 1, it follows that Pn i=1 witp i Pn i=1 wi = 1. Therefore, log(witp i ) 0 for each i. Given p > q > 0, we obtain

umax + p 1 log

umax + q 1 log

= log M(u; w, p) log M(u; w, q).

This derivation similarly holds for the case 0 > p > q, demonstrating the monotonicity of log M(u; w, p) with respect to p. The continuity of log M(u; w, p) with respect to p at p = 0 ensures the monotonicity for all p R. Since log is a strictly increasing function, this guarantees the monotonicity of M(u; w, p).

Proof of Lemma 2.1 (b). Since Pd i=1 wi = 1, we express wd = 1 Pd 1 i=1 wi and consider d 1 variables.

For p = 0, we have

log M(u; w, p) = p 1 log

i=1 wiup i + up d

= log M(u; w, p)

up i up d Pd 1 i=1 wiup i + up d 1 Pd 1 i=1 wi

up i up d Pd i=1 wiup i

Pd i=1 wiup i is positive since it is a positive weighted sum of utilities. We aim to show that

p 1(up i up d) > 0.

Suppose that ui > ud. If p > 0, then up i up d > 0. Conversely, if p < 0, then up i up d < 0, but since p 1 is also negative, the product remains positive. Thus, if ui > ud, the log-norm increases with wi. A similar argument shows that the log-norm decreases if ui < ud.

For p = 0, we have

log M(u; w, 0)

i=1 wi log ui +

= lim p 0 log M(u; w, p)

This indicates that for ui > ud, the derivative is positive, implying an increase, and negative for ui < ud, implying a decrease. Thus, the function is monotonic for all wi [d 1]. Since log is a strictly increasing function, this guarantees the monotonicity of M(u; w, p).

Proof of Lemma 2.1 (c). We prove that log M(u; w, p) log M(v; w, p) is quasilinear. The proof for M(u; w, p) follows by setting v = 1d.

As noted in [3], the ratio of two linear functions is quasilinear when the denominator is strictly positive. Given that w, vp > 0 for all w d 1, it follows that

f(w) = w, up

is a quasilinear function. Furthermore, since f(w) > 0 for all w d 1, and for any x > 0, x1/p is monotone for p R \ {0}, the expression M(u; w, p)/M(v; w, p) = f(w)1/p is also quasilinear. Because g(x) = log(x) is monotone, quasilinearity is preserved, which implies that log M(u; w, p) log M(v; w, p) is quasilinear as well.

A.2 Properties of Loss Functions

A.2.1 Quasiconvexity of ℓ2 Loss

Since M(u; w, p) is quasilinear by Lemma 2.1 (c), it follows that f(w) = M(u; w, p) y is also quasilinear. For w1, w2 d 1, we therefore have

min {f(w1), f(w2)} f(λw1 + (1 λ)w2) max {f(w1), f(w2)} ,

= f(λw1 + (1 λ)w2)2 max f(w1)2, f(w2)2 .

Thus, f(w)2 = (M(u; w, p) y)2 is quasiconvex.

A.2.2 Quasilinearity of Logistic Loss

By Lemma 2.1 (c), we know that log M(u; w, p) log M(v; w, p) is quasilinear. We consider two cases for the logistic loss. For y = 1, since log σ(x) = log(1 + exp( x)) is a monotonic function, it preserves quasilinearity. For y = 0, we have log(1 σ(x)) = log(1 + exp( x)) + x, which is also monotonic and therefore preserves quasilinearity.

Using these properties, we conclude that the logistic loss term y log σ(x) (1 y) log(1 σ(x)) is a quasilinear function.

A.3 Proof of Lemma 3.1

We differentiate between two representations of Rademacher complexity:

i=1 ϵif(xi)

ˆRabs(F) = 1

i=1 ϵif(xi)

It is clear that ˆR(F) ˆRabs(F). We now list a few results related to Pollard s pseudo-dimension.

Definition A.1 (Pseudo-shattering). Let H be a set of real valued functions from input space X. We say C = (x1, . . . , xm) is pseudo-shattered if there exists a vector r = (r1, . . . , rm) such that for all b { 1}m = (b1, . . . , bm), there exists hb H such that sign (hb(xi) ri) = bi.

Definition A.2 (Pseudo-dimension ). The pseudo-dimension Pdim(H) is the cardinality of the largest set pseudo-shattered by H.

The following lemma connects pseudo-dimensions to VC dimensions:

Lemma A.3 (Mohri [15]). For every h H, define the binary function Bh(x, r) = sign (h(x) r). Define BH = {Bh : h H}. Then, V C(BH) = Pdim(H)

The following lemma bounds the Rademacher complexity using pseudo-dimension and covering numbers. Lemma A.4 (Mohri [15]). For F [0, 1]X with Pdim(F) d,

N(ϵ, F, dn) c

where dn(f, g) = 1 n Pn i=1 (f(xi) g(xi))2 1/2 .

We also have the following covering number bound for Rademacher complexity: Lemma A.5 (Mohri [15]). For F [0, 1]X ,

ˆRabs(F) inf ϵ>0

2 log 2N(ϵ, F, dn)

We now turn to bounding the complexity for the unknown and known weights cases:

Proof of Lemma 3.1 (a). The function class is:

Mw,d = {M(r; w, p)|p R} .

Moreover, from Lemma 2.1, it follows that for a fixed u Rd, M(u; w, p) is a non-decreasing function with respect to p. Consequently, there exists a p R { } such that for any y (umin, umax), we have M(u; w, p) < y for all p < p , and M(u; w, p) y for all p p . This implies that BM(u, y) = sign(M(u; w, p) y) changes its sign exactly once as p increases.

We note that for BM(x, y), one point can be shattered (by choosing p < p and p > p ). However, for two points u and v, the number of times a sign change occurs with increasing p for either u or v is at most twice, meaning that only 3 labels can be achieved. Thus, 2 points cannot be shattered.

Proof of Lemma 3.1 (b). The function class is:

Md = {M( ; w, p)|p R} .

We note that

BM(u, y) = sign (M(u; w, p) y) = sign (log M(u; w, p) log y) = sign (log M(u; w, p) log M(y 1d; w, p))

which, we observe, is exactly the expression in the noiseless comparison-based setup for the unknown weights case. We show in Lemma 4.1 (b) that the VC dimension for this expression is upper bounded by 8(d log2 d + 1). Thus, our result is proved.

A.4 Rademacher Complexity Bound for the Cardinal Case

Lemma A.6. (a) If w is known, then

ˆR(Mw,d) umax

2 log 2 + 2 log n

(b) If w is unknown, then

ˆR(Mw,d) umax

2 log 2 + 16(2 log2 d + 1) log n

where c > 0 is a constant.

Proof. We prove the result for the case of unknown weights; the result for known weights follows by replacing the pseudo-dimension bound of Sd with that of Sw,d from Lemma 3.1. Let dp denote the pseudo-dimension in the unknown weights case.

ˆR(Md) = Eϵ

" 1 n sup (w,p) d 1 R

i=1 ϵi M(ui; w, p)

" 1 n sup (w,p) d 1 R

i=1 ϵiumax M(ri; w, p)

" 1 n sup (w,p) d 1 R

i=1 ϵi M(ri; w, p)

" 1 n sup (w,p) d 1 R

i=1 |ϵi M(ri; w, p)|

= umax ˆRabs(Sd).

From Lemma A.4 and Lemma A.5, and given that log M(r; w, p) [0, 1], we have

N(ϵ, Sd, dn) c

ˆRabs(Sd) inf ϵ>0

2 log 2 + 4dp log(c/ϵ)

2 log 2 + 2dp log n

n + c n. (setting ϵ = c/ n)

thus giving

ˆR(Md) = umax

2 log 2 + 2dp log n

Substituting dp = 8(d log2 d + 1) yields the required bound.

We observe that the above lemma provides bounds of O( p

log(n)/n) for known weights and O( p

d log(d) log(n)/n) for unknown weights on the Rademacher complexity. Notably, these bounds depend on umax, indicating that the complexity of the function class grows as the maximum utility value increases.

A.5 Proof for Theorem 3.2

Proof. We prove the result for the case of unknown weights; the result for known weights follows a similar argument. For the ℓ2 loss, the function class is defined as

L2 = ℓ2(M(u; w, p), y) = (y M(u; w, p))2|(w, p) d 1 R .

Since both M(u; w, p) and yi lie within the range [ui(1), ui(d)] [umin, umax], we have y M(u; w, p) [umin umax, umax umin]. Over the bounded range [ γ, γ], the function ℓ2(t, y) = (t y)2 is 2γ-Lipschitz continuous with respect to t. Thus, using Talagrand s contraction lemma and

Lemma A.6, we get

ˆR(L2) = ˆR(ℓ2 Md) 2(umax umin) ˆR(Md).

We then use the uniform convergence bounds for Rademacher complexity to obtain

sup (w,p) d 1 R

ˆRn(w, p) R(w, p) 8(umax umin) ˆR(Md) + 3

R( ˆw, ˆp) R(w, p) = ˆRn( ˆw, ˆp) ˆRn(w, p) + ˆRn(w, p) R(w, p) + R( ˆw, ˆp) ˆRn( ˆw, ˆp)

0 + ϵ + ϵ = 2ϵ

= 16(umax umin) ˆR(Md) + 6

Finally, substituting ˆR(Md) from Lemma A.6 provides the required bounds.

A.6 Proof of Lemma 4.1

First, we state a lemma from Jameson [9]: Lemma A.7 (Jameson [9], Theorem 4.6). Let f : R R be defined as f(p) = Pn i=1 ai exp(bix), where b1 > b2 > . . . > bn and Pn i=1 ai = 0. Define Aj := Pj i=1 ai and denote by S(Aj) the number of sign changes in the sequence {Ai}j i=1. Then, the number of unique zeros of f is at most S(An) + 1.

Consider the function f(p) = Pd i=1 wiup i Pd i=1 wivp i for u, v Rd with disjoint entries:

i=1 wivp i =

i=1 wi exp(p log ui)

i=1 wi exp(p log vi).

Applying Lemma A.7, if wi = w for all i, the sequence {Aj}, consisting of sums of w or w, can have at most d 1 sign changes. A sign change at index k implies Ak 1 = 0, and the next sign change cannot occur before index k + 2. Therefore, f(p) has at most d zeros in this case. In the general case, where wi = wj for some i = j, a sign change in {Aj} can occur at any index except the first and the last. Thus, f(p) can have at most 2d 1 roots, as sign changes are possible at all intermediate indices. We conclude that Mp(u; w, p) Mp(v; w, p), defined over R { }, can change sign as a function of p at most d 1 times if wi = 1

d, and up to 2d 1 times in the general case. Lemma A.8. Let r, q : R R be two polynomials such that c := r(x) q(x) is a constant for all x R, and the sets of roots {x1, . . . , xd} and {y1, . . . , yd} of r and q respectively are disjoint, positive and of size d. Then, for k = 0, 1, . . . , d 1, the k-th power sums of the roots x1, . . . , xd and y1, . . . , yd are equal, i.e.: d X

Proof. Given that r q = c, both polynomials have the same non-constant coefficients. According to Vieta s formulas, the k-th elementary symmetric polynomial of x, ek(x), is the sum of all products of k distinct xi s, and similarly ek(y) for y. If r q = c, it implies that the symmetric polynomials derived from the roots of r and q are equal, ek(x) = ek(y).

Newton s identities relate the elementary symmetric polynomials and the power sums as follows:

i=1 ( 1)i 1ek i(x)pi(x), kek(y) =

i=1 ( 1)i 1ek i(y)pi(y).

where pi(x) is the i-power sums of the roots. Given that ek(x) = ek(y), we can equate the right-hand sides of the above identities to obtain the power sums pi(x) and pi(y). This yields pk(x) = pk(y) for each k, due to the recursive nature of Newton s identities and the fact that the elementary symmetric polynomials of x and y are equal for all k d.

Hence, pk(x) = pk(y) for all k = 0, . . . , d 1, which concludes the proof.

Given f(p) as defined above, Lemma A.8 implies there exists disjoint u, v Rd such that Mp(u; w, p) = Mp(v; w, p), for d 1 unique values of p and for wi = 1/d. Moreover, suppose that for a set {pi}i [d] there exist u, v Rd and w d 1 such that Mp(u; w, pi) = Mp(v; w, pi). Then, for any λ > 0, there exist u , v Rd such that Mp(u ; w, λpi) = Mp(v ; w, λpi).

Lemma A.9 (Jameson [9], Theorem 3.4). For any k < d and p1 < . . . < pk R, there exist u, v Rd + and w d 1 such that Mp(u; w, pi) = Mp(v; w, pi) for each i k. Furthermore, the difference Mp(u; w, pi) Mp(v; w, pi) does not change sign within any interval (pi, pi+1).

We now proceed to the proof of Lemma 4.1, which bounds the VC dimensions of the function classes Cw,d and Cd.

Proof of Lemma 4.1 (a). As there are at most 2d 1 roots to f(p), there can be at most 2d 1 sign changes as p varies from to . Consequently, the hypothesis class defined by all p (denoted as Mw,d) is a subset of the hypothesis class that consists of at most 2d 1 sign changes on the real line. This larger hypothesis class is denoted by Hd, and we have VC(Mw,d) VC(Hd).

Let us consider m samples {ui, vi}m i=1. For each sample, sign changes occur at most 2d 1 times, and hence the total number of changes in labeling over the entire real line is bounded by (2d 1)m (as p changes, each change in labeling corresponds to a change in sign for at least one of the samples). This implies that the total number of possible labelings is (2d 1)m + 1.

If the set of m samples is shattered, the upper bound derived above should be at least as large as the total number of labelings possible. We thus have:

(2d 1)m + 1 2m.

We can show that m = 2( log2 d + 1) points cannot be shattered. Consider

2m (2d 1)m 1 = 22( log2 d +1) 2(2d 1)( log2 d + 1) 1

22(log2 d+1) 4d log2 d + 2 log2 d 4d + 1

= 4d2 4d log2 d + 2 log2 d 4d + 1 = 4d(d log2 d ) 4d + 2 log2 d + 1 4d 4d + 2 log2 d + 1, since d log2 d 1 d N = 2 log2 d + 1 > 0.

Thus, m > 2(log2 d + 1) points cannot be shattered, meaning that V C(Hd) < 2(log2 d + 1).

We now bound the VC dimension for the unknown weight case. Consider p = 0. In this case, a hypothesis C((u, v); w, p) can be expressed as

log Pd i=1 wiup i log Pd i=1 wivp i

= sign (p) sign

= sign (p) sign

= sign (p) sign ( w, up vp ) = sign ( w, sign (p) (up vp) ) .

where up = (up 1 up d)T . Thus, for a fixed p, the set of viable w s spans a halfspace. We note that each component of sign (p) (up vp) is continuous, which means that w, sign (p) (up vp) is a continuous function in w and p.

For n > d samples {((ui, vi), yi}n i=1, we define hi(p) = sign (p) (up vp) , i [n], p = 0. For a fixed p, we note that the set of possible labelings for w d 1 is a subset of the set of possible labelings for w Rd, which in turn is the set of labelings generated by n hyperplanes. Since this problem has VC dimension d, the number of possible labelings for a fixed p is upper bounded by (n + 1)d. Let B(p) denote the set of possible labelings for hyperplanes defined by {hi(p)}n i=1 for a particular p.

Lemma A.10. Let p1 and p2 have the same sign, with a labeling ℓ { 1}n such that ℓ B(p1) but ℓ B(p2). Then, there is a p [p1, p2] such that there is a set of d linearly dependent vectors h(1)(p), . . . , h(d)(p).

Proof. Let ℓbe the labeling which is in B(p2) but not in B(p1). Since this labeling is not in B(p1), for each w, there is some hyperplane hi(p1) such that ℓi w, hi(p1) < 0. Since this labeling is in B(p2), there is some w such that ℓi w, hi(p2) 0 for every i [n].

Let B Rd denote the unit hypersphere around the origin. Since the labelings are invariant to the scale of w, the set of possible labelings for w Rd is exactly the set of possible labelings for w B

Consider the quantity m(p) = max w B min i ℓi w, hi(p) .

We observe that if m(p) < 0, for each w there is some i [n] such that ℓi w, hi(p) < 0, i.e., the labeling is not attained at any point. On the other hand, if m(p) 0, there is some w such that the labeling is attained at w. Since ℓi w, hi(p) is a continuous function in w and p for all i [n], mini ℓi w, hi(p) is also a continuous function in w and p. Thus, m(p) is also a continuous function in p.

Using this fact and the intermediate value theorem, there should be some p [p1, p2] such that m(p) = 0. Let w B be a vector at which m(p) = 0 is attained. We now show that at this p, at least d of the n vectors {hi(p)}n i=1 are linearly dependent.

Suppose this were not the case, i.e., any set of d vectors in the set is linearly independent. This means that at most d 1 of the vectors lie on the hyperplane {x : w, x = 0}. Let h(1)(p), . . . , h(k)(p) denote these vectors, with k d 1. Since m(p) = 0, there should be at least one such vector. Let H Rk d be the matrix with these vectors as the rows.

If these k vectors are linearly dependent, we can add any of the remaining n k vectors to get a set of d linearly dependent vectors. Let us consider the case where they are not linearly dependent. Consider {x : Hx = 1k}. This is an underdetermined set of linear equations, and the set should be non-empty (because of linear independence of the k vectors). Let x0 be one of the vectors in this set.

Let t > maxi [n] w,hi(p)

x0,hi(p) . We have

w + tx0, hi(p) = w, hi(p) + t x0, hi(p)

> w, hi(p) max j [n] w, hj(p) x0, hj(p) x0, hi(p)

w, hi(p) w, hi(p)

x0, hi(p) x0, hi(p)

Thus, w + tx0 is a point such that w + tx0, hi(p) > 0 for all i [n]. This means that m(p) > 0, which is a contradiction. Intuitively, this means that if any d vectors in {hi(p)}n i=1 are linearly independent, then m(p) > 0. Thus, there should be a set of d linearly dependent vectors h(1)(p), . . . , h(d)(p).

From the above lemma, we observe that any change in the set of labelings is accompanied by a p which gives d linearly dependent vectors.

Proof of Lemma 4.1 (b). Using the lemma above, to bound the number of possible labelings, we first bound the number of p s such that there are d linearly dependent vectors.

Consider a set of d vectors h1(p), . . . , hd(p). As the vectors are linearly dependent, the determinant of the matrix constructed using these vectors should be zero. We should thus have

sign (p) (up 11 vp 11) sign (p) (up 1d vp 1d) ... ... ... sign (p) (up d1 vp 11) sign (p) (up 1d vp dd)

Upon expanding the determinant, we get an equation of the form Pm i=1 aiup i which has 2d d! terms. From an earlier lemma, we know that this equation should have at most 2d d! 1 roots. Upon adding the original configuration, we get 2d d! possible configurations. The choice of the d vectors can be

made in n d ways, and hence we have a bound on the possible changes as 2dd! n d . In the worst case, we assume that all the labelings are changed, and we thus get an upper bound on the changes as

(n + 1)d2dd! n d

In the beginning of the proof, we had carefully set aside p = 0. We now observe that p = 0 is a root of the above system of equations. Thus, we are implicitly considering any possible changes at p = 0 as well.

We can show that n = 8( d log2 d + 1) points cannot be shattered.

(n + 1)d2dd! n d

= (n + 1)d2d n(n 1) . . . (n d + 1)

> (n + 1)2dn2d 1

We now show that for every d N, the inequality 2d 1n2d 2n holds, i.e. n d 2d log2 n+1 > 0 for n = 8( d log2 d + 1). For d {1, 2}, this statement can be verified directly. Therefore, it suffices to show that f(x) := 8(x log2 x + 1) 2x log2[8(x log2 x + 1)] x + 1 > 0 for any x 3.

f(x) > 8x log2 x 2x log2(16x log2 x) x + 9, for x log2 x > 1, = 6x log2 x 2x log2(log2 x) 9x + 9 > 4x log2 x 9x + 9 = g(x).

We note that g(3) = 12 log2 3 18 > 1 > 0. Moreover, g (x) = 4 log2 x + 4/ log 2 9, and we note that g (2) = 4/ log 2 5 > 0.77 > 0, with g (x) being an increasing function. Thus, f(x) > g(x) > 0 for all x 3. This means that f(x) > 0, which proves our bound. This implies that the VC dimension is bounded above by 8(d log2 d + 1)

Proof of Lemma 4.1 (c). Take m = log2(d)+1. Let σ1, . . . , σ2m { 1}m be a Gray code ordering of the set { 1}m, such that two successive values have a Hamming distance of 1, and that the number of changes in different bit positions is at most 2m

2 d (for the existence of such an ordering, see [2]). For i < 2m, denote by si [m] the bit position in which σi and σi+1 differ. By using Lemma A.9 d times, there exists a sample {ui, vi}m i=1 and p1 < . . . < p2m 1, where pi satisfies uj pi = vj pi if si = j. Furthermore, define p0 = , and note that each interval (pi, pi+1) for 0 i < 2m corresponds to a unique combination of labels over {ui, vi}m i=1.

When the weights are unknown, we observe that p = 1 is similar to the case of linear classification with the constraint of positive weights and no bias term. This is because

C((u, v); w, 1) = sign (log M(u; w, 1) log M(v, 1)) = sign (M(u; w, 1) M(v; w, 1))

i=1 wi(ui vi)

Since linear classification with no bias term has a VC dimension of d 1, this is a lower bound for the VC dimension of Cw,d.

A.7 Proof of Theorem 4.3

Proof. We prove the result for unknown weights, with the known weights result following similar steps. We consider the function class Cd as in Section 4, with ℓ0 1 loss being ℓ0 1(t, y) = (1 + ty)/2, t, y { 1}. We observe that ℓ0 1 is 1/2 Lipschitz w.r.t. t. Thus, by applying Theorem 3 of Natarajan et al. [16], we observe that w.r.t. ℓ0 1 on the noiseless data distribution,

R( ˆw, ˆp) R(w, p) 4Lρ ˆR(Cd) + 2

where Lρ = (1 + |ρ+1 ρ 1|)L/(1 ρ+1 ρ 1). Here, ρ+1 and ρ 1 are defined as the probability of mislabeling true positive and true negative examples, which in our case are the same value, ρ. Thus, Lρ = 1/(2(1 2ρ)) in our case. We obtain ˆR(Cd) using the VC bound on Rademacher complexity:

16(d log2 d + 1) log(n + 1)

Substituting it in Equation (2) concludes our proof.

A.8 Proof of Theorem 5.1

Proof. We prove the result for the case of unknown weights; the known weights case follows similar steps. First, we establish a bound on the Rademacher complexity of

Tw,d = {τ (log M( ; w, p) log M( ; w, p)) | τ, p} .

log M(ui; w, p) = log ui(1) + log ui(d)

log M(ri; w, q),

log M(vi; w, p) = log vi(1) + log vi(d)

log M(r i; w, q).

i=1 ϵiτ (log M(ui; w, p) log M(vi; w, p))

i=1 ϵiτ log M(ui; w, p)

i=1 ( ϵi)τ log M(vi; w, p)

i=1 ϵiτ log M(ui; w, p)

i=1 ϵiτ log M(vi; w, p)

i=1 ϵiτ log ui(1) + log ui(d)

log M(ri; w, q) #

i=1 ϵiτ log vi(1) + log vi(d)

log M(r i; w, q) #

i=1 ϵiτ log ui(d)

log M(ri; w, q)

i=1 ϵiτ log vi(d)

log M(r i; w, q)

i=1 ϵiτ log M(ri; w, q)

i=1 ϵiτ log M(r i; w, q)

i=1 ϵi log M(ri; w, q)

i=1 ϵi log M(r i; w, q)

= 2τmaxκ ˆRabs(Sd).

We now apply the bound on ˆRabs(Sd) from Appendix A.4 to obtain the following bound on ˆR(Td):

ˆR(Td) 2τmaxκ

2 log 2 + 16(d log2 d + 1) log n

Finally, we use the uniform convergence bounds derived from Rademacher complexity to establish the following PAC bound:

R( ˆw, ˆp) R(w, p) = ˆRn( ˆw, ˆp) ˆRn(w, p) + ˆRn(w, p) R(w, p) + R( ˆw, ˆp) ˆRn( ˆw, ˆp)

0 + ϵ + ϵ = 2ϵ

= 8κ ˆR(Td) + 6

B Algorithm

Our algorithm can be broken into two nested steps. The first step consists of choosing p, and the second step involves conducting gradient descent on w (and possibly τ) to obtain their empirically optimal values, ˆw and ˆp. In our experiments we choose p using grid search. However, optimization over p can also be done using other methods like simulated annealing. We minimize the ℓ2 loss in the cardinal case with weighted power mean and the logistic loss in the ordinal case with log weighted power mean. The algorithm s pseudocode is presented in Algorithm 1.

Algorithm 1 ERM algorithm for weighted power mean-based optimization

Require: D = {(xi, yi)}n i=1 ˆw 1/d vbest 0

ˆp 0 for p [plower, plower + ϵ, . . . , pupper ϵ, pupper] do

v arg minw 1

n Pn i=1 ℓ(M(ui; w, p), yi) w arg minw 1

n Pn i=1 ℓ(M(ui; w, p), yi) if v < vbest then

ˆw w vbest v end if end for Return ˆw, vbest

Note that for the ordinal case, we would optimize over τ along with w. For our experiments, we set plower = 3.5 and pupper = 3.5. We use a grid resolution of ϵ = 0.1. Since the function is not convex, we use several tricks to ensure quick convergence:

While we use Algorithm 1 from [5] for projection onto the simplex, it can potentially be time consuming. Thus, we project the gradient wℓitself on the unit simplex and use it for gradient descent, with the simplex projection algorithm being used only when some weights become too small/negative.

To prevent the algorithm from taking excessively large steps, we use the learning rate to clip the norm of the gradient. More specifically, if gt is the gradient and λ is the learning rate, we use the update

gt+1 = gt min {λ, gt 2} gt gt 2 .

If the optimal value hasn t improved in a certain number of iterations, the algorithm may be oscillating above the minimum. We thus halve the learning rate to encourage better convergence.

If the learning rate becomes too small, the steps taken would be too small to change the loss significantly. Thus, we terminate the algorithm. We also terminate the algorithm if the range of the past few losses is too small.

We also conduct gradient descent parallely starting from d + 1 points. The d + 1 points correspond to points close to the vertices of the simplex (corresponding to almost one-hot vectors) and the centroid of the simplex. This was done since convergence was observed to be slow for certain weights. At each step, vbest is updated according to the point giving the minimum loss.

We ran the experiments on an NVIDIA RTX A5000 GPU. The algorithm with the above settings takes about 30 minutes to check for all 71 values of p.

C Semi-synthetic Experiments: Further Information

Here we provide additional results and details from our semi-synthetic experiments. Figure 3 shows cardinal case results with varying sample sizes and noise levels. Figure 4 provides ordinal case results across different sample sizes and values of τ. Figure 5 displays ordinal case results for p = 1.62 across varying values of τ.

50 100 200 400 800 Number of samples

(a) Train loss

50 100 200 400 800 Number of samples

Test loss (noiseless)

(b) Noiseless test loss

0.0 0.001 0.01 0.1 0.167

Figure 3: More results for cardinal case with number of samples. Different lines show results for different values of added noise. Solid lines correspond to values for learnt parameters, whereas dotted lines correspond to values for real parameters.

D Additional Plots

Figure 6 shows a pair of utility vectors (u, v) such that with w = 1d/d, log M(u; w, p) log M(v; w, p) is non-convex. Upon slightly changing the value of v to v , we see that there can be significant change in the region {p : log M(u; w, p) log M(v; w, p) > 0}.

E Simulations

We conduct additional simulations on cardinal and ordinal data with logistic noise.

For each d and n, we construct a dataset in a specified range [umin, umax]d = [1, 1000]d. Each individual i is assumed to have a scaled and translated beta distribution over [umin, umax], with the parameters (αi, βi) differing for each i. Utilities for each action are drawn independently for each individual to construct a utility vector. The underlying weight vector is sampled uniformly from d 1.

To learn p (and w if needed), we assume p to be in a fixed range, in this case [ 10, 10]. We begin with a random sampling stage, where Nrandom instances of p (and w) are uniformly sampled. At the end of this stage, we select the parameter set with the lowest training loss and then perform gradient descent for Ngrad steps. We observe that this two-stage method yields good results across the values of d we consider. Each setting is run three times to obtain error bounds on the empirical results.

500 1000 2000 4000 8000 Number of samples

(a) Train loss

500 1000 2000 4000 8000 Number of samples

(b) Learnt p

500 1000 2000 4000 8000 Number of samples

Test accuracy

(c) Test accuracy

500 1000 2000 4000 8000 Number of samples

Test loss (noiseless)

(d) Noiseless test loss

0.1 1.0 10.0 40.0 100.0

Figure 4: More results for ordinal case with number of samples. Different lines show results for different values of τ. Solid lines correspond to values for learnt parameters, whereas dotted lines correspond to values for real parameters.

In the unknown weights case, we observe that sampling occurs in d dimensions. As d increases, we encounter the curse of dimensionality, meaning that Nrandom would need to grow exponentially with d to maintain sampling density across different d. This makes maintaining the same density impractical for larger dimensions. As a compromise, we increase Nrandom linearly with d.

E.1 Cardinal Values

For cardinal values, we add Gaussian noise to each yi with standard deviation (u(d) u(1))/10 and clamp the values to [u(1), u(d)]. Experiments are conducted for both known and unknown weights with p = 2. Figure 7a (known weights) and Figure 7b (unknown weights) show the estimated test loss on noiseless test data generated using the true parameters.

We observe relatively little change in the test loss difference for known weights as n increases. However, for higher d, there is a greater decrease in test loss with increasing n when weights are also being learned. The estimated test loss also increases with d, with a stronger trend for unknown weights.

E.2 Logistic Noise

For logistic noise, we generate pairs of utility vectors with p = 0.9 and a w obtained through random sampling, then mislabel each instance according to Equation (1) with τ = 10. Since we also need to learn τ, we set τmax = 50 and sample it uniformly along with p (and w). Figure 7a (known weights) and Figure 7b (unknown weights) show the accuracy on noiseless test data of the learned parameters.

500 1000 2000 4000 8000 Number of samples

(a) Test loss

500 1000 2000 4000 8000 Number of samples

KL divergence

(b) KL(w w)

500 1000 2000 4000 8000 Number of samples

Test accuracy (noiseless)

(c) Noiseless test accuracy

500 1000 2000 4000 8000 Number of samples

(d) Train loss

500 1000 2000 4000 8000 Number of samples

(e) Learnt p

500 1000 2000 4000 8000 Number of samples

Test accuracy

(f) Test accuracy

500 1000 2000 4000 8000 Number of samples

Test loss (noiseless)

(g) Noiseless test loss

0.1 1.0 10.0 40.0 100.0

Figure 5: More results for ordinal case with p = 1.62. Different lines show results for different values of τ. Solid lines correspond to values for learnt parameters, whereas dotted lines correspond to values for real parameters.

1 2 3 4 5 6 p

Difference in log power means

(u, v) (u, v )

Figure 6: An example showing the non-convexity of log M(u, w, p) log M(v, w, p). We see that the function has five roots for (u, v), but is translated downwards for (u, v ) and has only three roots in this case. If the correct label is 1 for both pairs, then p should be greater than 6; however, gradient-based optimization can stop between 3 and 4, which is a local optimum and does not give correct labels to both points.

Estimated (noiseless) Test Loss

(a) Cardinal social welfare values, known weights

Estimated (noiseless) Test Loss

(b) Cardinal social welfare values, unknown weights

Test Accuracy (noiseless)

(c) Pairwise comparisons with logistic noise, known weights

Test Accuracy (noiseless)

(d) Pairwise comparisons with logistic noise, unknown weights

d = 5 d = 10 d = 25 d = 50

Figure 7: Results for synthetic data on cardinal and ordinal logistic tasks

Across different settings, the proportion of correctly labeled samples in the training dataset has a mean of 71.4%, with a maximum of 86.5%.

For known weights, accuracy increases with n, and mean accuracy remains high (> 93%) across d. Differences between curves for various values of d are minimal, with all approaching near-perfect accuracy as n becomes large. This suggests that error bounds may be independent of d. For unknown weights, a clear trend of decreasing performance with increasing d is observed, expected due to the O( d log d) dependence of logistic loss error bounds. Nevertheless, all settings achieve high

accuracy as n increases. Up to moderately high d, the logistic noise model successfully finds highly accurate parameters despite significant mislabeling in the training data.

0 5 10 15 20 25 = n/(dlog nlog d)

Test Accuracy (noiseless)

Figure 8: Verification of O(d log d) risk bound for ordinal case with logistic noise, unknown weights

In Figure 8, we re-plot the test accuracy α on noiseless data against η = p

n/(d log n log d), a re-scaled version of Figure 7d. Theoretically, α and η are related as 1 α = O(1/η). The alignment of all curves in Figure 8, compared to the original curves in Figure 7d, provides evidence that our risk and sample complexity bounds indeed scale as d log n log d for the ordinal case with logistic noise and unknown weights.