# hypothesis_testing_for_generalized_thurstone_models__5da4df85.pdf

Hypothesis Testing for Generalized Thurstone Models

Anuran Makur * 1 2 Japneet Singh * 2

In this work, we develop a hypothesis testing framework to determine whether pairwise comparison data is generated by an underlying generalized Thurstone model TF for a given choice function F. While prior work has predominantly focused on parameter estimation and uncertainty quantification for such models, we address the fundamental problem of minimax hypothesis testing for TF models. We formulate this testing problem by introducing a notion of separation distance between general pairwise comparison models and the class of TF models. We then derive upper and lower bounds on the critical threshold for testing that depend on the topology of the observation graph. For the special case of complete observation graphs, this threshold scales as Θ((nk) 1/2), where n is the number of agents and k is the number of comparisons per pair. Furthermore, we propose a hypothesis test based on our separation distance, construct confidence intervals, establish time-uniform bounds on the probabilities of type I and II errors using reverse martingale techniques, and derive minimax lower bounds using information-theoretic methods. Finally, we validate our results through experiments on synthetic and real-world datasets.

1. Introduction

Learning rankings from data is a fundamental problem underlying numerous applications, including recommendation systems (Jannach et al., 2016), sports tournaments (Bradley & Terry, 1952; Cattelan et al., 2013), fine-tuning large language model (LLMs) (Ouyang et al., 2022), and social

The author ordering is alphabetical. 1Department of Computer Science, Purdue University, West Lafayette, IN, USA 2Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA. Correspondence to: Anuran Makur <amakur@purdue.edu>, Japneet Singh <sing1041@purdue.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

choice theory (Luce, 1959; Yellott, Jr., 1977). The class of generalized Thurstone models (GTMs) (Thurstone, 1927; Nelder & Wedderburn, 1972; Mc Cullagh & Nelder, 1989), which fall under the broader framework of random utility models, is a widely adopted framework for ranking agents, items, or choices based on given preference data. GTMs include many other models as special cases, most notably the Bradley-Terry-Luce (BTL) model (Bradley & Terry, 1952; Luce, 1959; Mc Fadden, 1973), which has been widely studied. Given n agents [n] = {1, . . . , n}, GTMs can be construed as likelihood models for pairwise comparisons between pairs of agents. In particular, a GTM TF assumes that each agent i is endowed with an unknown utility parameter wi R and the probability that agent i is preferred over agent j (e.g., i beats j in a game) is given by F(wi wj), where F represents a known choice function which is a cumulative distribution function (CDF).

While GTMs have been utilized in many contexts, e.g., (Thurstone, 1927; Elo, 1986), they are parametric models where n utility parameters characterize the model. Indeed, the assumption that pairwise comparison data is governed by a small number of parameters forms the basis of most results on GTMs (Cattelan et al., 2013; Vojnovic & Yun, 2016; Shah et al., 2016; Jadbabaie et al., 2020; 2024). However, such parametric models can sometimes be too restrictive, failing to capture intricacies in real-world applications (Davidson & Marschak, 1974; Mc Laughlin & Luce, 1965; Tversky, 1972). Notably, GTMs struggle to accommodate context-dependent effects, such as the home-advantage effect observed in sports tournaments (Clarke & Norman, 1995; Morley & Thomas, 2005), where teams may perform differently when playing at home versus away. Furthermore, GTMs assume transitive relationships, which may not hold in real-world datasets. To accurately capture the complex and diverse behaviors observed in real-world data, non-parametric models, e.g., (Chatterjee, 2015; Shah et al., 2016), have been studied as an alternative. This conversation raises an important question: Given pairwise comparison data, can we determine whether it is comes from a specific GTM? If it does, then we can rely on the vast GTM literature for learning and interpretation, and if it does not, then we can resort to using other parametric models, such as the Mallows model (Mallows, 1957), or non-parametric models (Seshadri et al., 2020; Mania et al., 2018; Bong et al., 2020).

Hypothesis Testing for Generalized Thurstone Models

Despite extensive research in the area, there is no systematic answer to the above question in the literature, i.e., there is no rigorously analyzed hypothesis test to determine whether given pairwise comparison data conforms to an underlying GTM model. To address this, we study the composite hypothesis testing problem of whether data obeys a GTM TF for a given choice function F (which is a special kind of CDF):

H0 : Z TF for some choice of w W,

H1 : Z pairwise comparison model that is not TF , (1)

where Z denotes the pairwise comparison data, H0 and H1 are the null and alternative hypotheses, respectively, and W denotes the parameter space for the parameters w.

1.1. Main Contributions

We analyze the composite hypothesis testing problem outlined in (1). Our main contributions include the following:

1. We frame the hypothesis testing problem in a minimax sense (Section 2) by developing a rigorous notion of separation distance to the class of all GTMs that admits tractable analysis (Section 3, Theorem 3.1). 2. We derive upper and lower bounds on the critical threshold for our test (Section 3, Theorem 3.2 and Proposition 3.5). These bounds exhibit a dependence on the graph induced by the pairwise comparison data (see Table 1) and are tight for complete graphs. 3. We use the separation distance to propose a hypothesis test and establish various theoretical guarantees for our test. Specifically, we prove time-uniform type I and type II error probability upper bounds for our test (Section 3, Theorems 3.4 and 3.7), and also provide a minimax lower bound. 4. Additionally, we obtain auxiliary results like ℓ2-error bounds on parameter estimation for general pairwise comparison models (Theorem 3.3) and time-uniform confidence intervals under the null hypothesis (Proposition 3.8). 5. Finally, we validate our theoretical findings through synthetic and real-world experiments, proposing a datadriven approach to determine the test threshold and using the test to determine different choice functions fit to the data (Section 4).

1.2. Related Literature

The class of GTMs has a rich history in the analysis of preference data. Initially proposed by Thurstone (Thurstone, 1927), these models are widely used in various fields, ranging from psychology (Thurstone, 1931), economics (Marschak, 1974), and more recent applications like aligning LLMs with human preferences (Ouyang et al., 2022).

Table 1: Bounds in this work on critical threshold εc for various induced observation graphs, where n represents the number of agents and k is the number of comparisons per pair of agents.

Graph Type Upper Bound Lower Bound

Complete graph O 1

d-Regular graph O 1

Single cycle O 1

Toroidal grid O 1

Early foundational works, e.g., (Thurstone, 1927; Luce, 1959; Yellott, Jr., 1977), explored different cumulative distribution functions F for modeling choice probabilities, including Gaussian (Thurstone, 1927), logistic (Bradley & Terry, 1952), and Laplace (Dawkins, 1969). These models and their extensions underlie popular rating systems, such as Elo in chess (Zermelo, 1929; Elo, 1986) and True Skill in video games (Herbrich et al., 2006). Several recent works have actively explored estimation techniques for Thurstone models. For instance, (Vojnovic & Yun, 2016) estimated parameters of Thurstone models when the preference data is derived from general subsets of agents (not specifically pairs), and (Shah et al., 2016) focused on parameter estimation for GTMs and the effect of graph topology on the estimation accuracy.

Furthermore, a significant portion of the literature has focused on parameter (and skill distribution) estimation in the special case of the BTL model, e.g., (Simons & Yao, 1999; Negahban et al., 2017; 2012; Chen et al., 2019; Jadbabaie et al., 2020; 2024), where two prominent algorithms are spectral ranking (Negahban et al., 2017; 2012) and maximum likelihood estimation (Zermelo, 1929; Hunter, 2004). Another related line of work is on uncertainty quantification for estimated parameters (Gao et al., 2023; Han et al., 2024; Fan et al., 2024; Liu et al., 2023). For example, (Gao et al., 2023) established the asymptotic normality of estimated parameters in the BTL model for both spectral ranking and maximum likelihood estimation, and (Han et al., 2024) generalized the asymptotic normality results to a broader class of models, such as GTMs and Mallows model. In a different vein, (Chierichetti et al., 2022) provided sharp sample complexity bounds for A/B testing needed to distinguish similar alternatives.

Despite the extensive work on parameter estimation, relatively few studies have rigorously investigated hypothesis

Hypothesis Testing for Generalized Thurstone Models

testing for such parametric models. Notably, (Rastogi et al., 2022) developed two-sample tests for preference data, while (Busa-Fekete et al., 2021) studied hypothesis testing for the Mallows model. (Seshadri & Ugander, 2019) studied lower bounds for testing the independence of irrelevant alternatives (IIA) assumption (i.e., BTL and Plackett-Luce models (Luce, 1959; Plackett, 1975)), and (Makur & Singh, 2023; 2024) developed hypothesis tests for BTL models based on spectral methods. In contrast to these works, we develop hypothesis testing for GTMs using a maximum likelihood framework, complementing the work in (Makur & Singh, 2023; 2024).

2. Formal Model and Setup

We begin by introducing a general pairwise comparison model that provides a flexible framework encompassing a broad range of established probabilistic models, including the BTL model (Bradley & Terry, 1952; Luce, 1959; Mc Fadden, 1973), the Thurstone model (Thurstone, 1927), and non-parametric models (Chatterjee, 2015; Shah et al., 2016). In this framework, we consider n N\{1} agents (or items or choices) [n] engaged in pairwise comparisons. For agents i, j [n] with i = j, let pij (0, 1) denote the probability that i is preferred over j in an i vs. j pairwise comparison. This model inherently captures the asymmetric nature of pairwise comparisons, as the outcome of an i vs. j comparison may differ from that of a j vs. i comparison. This reflects real-world phenomena like home-advantage that are commonly observed in sports (Clarke & Norman, 1995; Morley & Thomas, 2005).

To model the fact that not all pairwise comparisons may be observed, we assume that we are given an induced observation graph G = ([n], E), where an edge (i, j) E (with i = j) exists if and only if comparisons of the form i vs. j are observed. Let E {0, 1}n n be the adjacency matrix of G, with Eij = 1 if (i, j) E and 0 otherwise. Furthermore, we assume that the edge set E is symmetric (i.e., G is undirected), implying that if i vs. j comparisons are observed, then j vs. i comparisons are observed as well. Additionally, we assume that G is connected and is fixed a priori (see Proposition 2.4), independent of the outcomes of the observed pairwise comparisons.

We also let D Rn n be the diagonal degree matrix with Dii = Pn j=1 Eij for i [n], and L D E be the graph Laplacian matrix. L can be expressed as L = XTX, where X R(|E|/2) n is the matrix formed by collecting row vectors xij = ei ej for (i, j) E and j > i, with ei being the ith standard basis vector in Rn. For the Laplacian L, we define the semi-norm with respect to L as x L =

x TLx for all x Rn (which is a semi-norm since L has a zero eigenvalue).

2.1. Comparison Models

Pairwise comparison model: Given the above preamble, we now formally present general pairwise comparison models (which encompass BTL models (Bradley & Terry, 1952; Luce, 1959; Mc Fadden, 1973), Thurstone models (Thurstone, 1927), non-parametric models (Chatterjee, 2015; Shah et al., 2016), etc., as mentioned above).

Definition 2.1 (Pairwise Comparison Model). Given an observation graph G over the agents [n], we refer to the collection of probability parameters {pij : (i, j) E} as a pairwise comparison model.

Furthermore, we can represent a pairwise comparison model by a pairwise comparison matrix P [0, 1]n n with

( pij, (i, j) E, 0, otherwise. (2)

We remark that our ensuing analysis can be easily specialized to a symmetric setting where i vs. j and j vs. i comparisons are equivalent. In this case, E is automatically symmetric as assumed. On the other hand, the symmetry assumption on E is needed in asymmetric settings because GTMs inherently treat i vs. j and j vs. i comparisons as equivalent, which is not true in general models.

GTM model: Next, we describe a GTM for a choice function F : R [0, 1], which is a special kind of CDF (to be explained in the sequel).

Definition 2.2 (Generalized Thurstone Model). Given an observation graph G, a pairwise comparison model is said to be a generalized Thurstone model (GTM) TF with choice function F : R [0, 1] if there exists a weight (or utility) vector w W such that:

(i, j) E, pij = F(wi wj),

where W Rn is a specified convex parameter space (usually Rn or a compact hypercube in Rn).

The GTM (Nelder & Wedderburn, 1972; Mc Cullagh & Nelder, 1989) posits that every agent i has a latent utility wi, and uncertainty in the comparison process is modeled by independent and identically distributed (i.i.d.) noise random variables X1, . . . , Xn with absolutely continuous CDF G : R [0, 1]. The discriminant variables (w1 + X1, . . . , wn + Xn) formed by combining utilities with the noise random variables are then compared to determine the outcomes of pairwise comparisons. Hence, the probability of preferring agent i over j is given by

P(i preferred over j) = P(wi + Xi > wj + Xj)

G(y + wi wj)G (y) dy = F(wi wj). (3)

Hypothesis Testing for Generalized Thurstone Models

As noted earlier, GTMs also encompass a wide range of known parametric models as special cases, e.g., Thurstone models with F(t) = R t 1

2πe x2/2 dx (complementary Gaussian CDF) (Thurstone, 1927), BTL models with F(t) = 1/(1 + e t) (sigmoid function, which stems from Gumbel CDFs) (Bradley & Terry, 1952; Luce, 1959; Yellott, Jr., 1977), Dawkins models (Dawkins, 1969), etc. We can also define a pairwise probability matrix F(w) [0, 1]n n

for a GTM TF with weight vector w via

( F(wi wj), (i, j) E, 0, otherwise. (4)

We next describe the data generation process for GTMs and general pairwise comparison models alike. For any pair (i, j) E, define the outcome of the mth i vs. j pairwise comparison between them as the Bernoulli random variable

( 1, i preferred over j (with probability pij), 0, j preferred over i (with probability 1 pij), (5) for m [kij], where kij denotes the number of observed i vs. j comparisons. The given pairwise comparison data is then a collection of these independent Bernoulli variables Z {Zm ij : (i, j) E, m [ki,j]}. For convenience, we

also let Zij Pkij m=1 Zm ij and ˆpij Zij/kij.

Parameter estimation for GTM: To present our testing formulation in the sequel, we explain how the parameters of a TF model are estimated given pairwise comparison data Z (Shah et al., 2016). First, we define the weighted negative log-likelihood function l : W [0, 1]|E| R+ {+ } as

l(w; {ˆpij : (i,j) E}) X

(i,j) E ˆpij log(F(wi wj))

+ (1 ˆpij) log(1 F(wi wj)). (6)

Note that this function represents a weighted variant of the typical log-likelihood function used in parameter estimation (Shah et al., 2016; Vojnovic & Yun, 2016). The weights of the TF model are estimated by minimizing l:

ˆw arg min w Wb l(w; {ˆpij : (i, j) E}), (7)

where the constraint set Wb {w W : w b, w T1 = 0} for some (universal) constant b, 1 Rn denotes an all-ones vector, and the constraint w T1=0 allows for identifiability of the weights.

2.2. Assumptions on Comparison Models

To facilitate the analysis of the hypothesis testing problem in (1), we introduce a simplifying assumption on the class of general pairwise comparison models. We assume that the pairwise probabilities pij are bounded away from 0 and 1.

Assumption 2.3 (Dynamic Range). There exists a constant δ > 0 such that for any pairwise comparison model under consideration, pij [δ, 1 δ] for all (i, j) E.

Note that under the null hypothesis, the Assumption 2.3 is satisfied by all TF models with weights bounded by F 1(1 δ)/2. Subsequently, we assume that the constant b satisfies b F 1(1 δ)/2. For any given pairwise comparison model {pij : (i, j) E}, define w Wb be the weights of a TF model that best approximates this pairwise comparison model in the maximum likelihood sense:

w arg min w Wb l(w; {pij : (i, j) E}). (8)

Finally, we also assume in the sequel that the given choice function F exhibits strong log-concavity and has a bounded derivative on [ 2b, 2b], i.e., there exist constants α, β > 0 such that:

x [ 2b, 2b], d2

dx2 log(F(x)) α and F (x) β. (9) Several popular GTMs, including the BTL and Thurstone (Case V) models, satisfy both the above assumptions. The following proposition highlights that w always exists and is unique for a strongly log-concave function F on Wb.

Proposition 2.4 (Existence and Uniqueness of Maximum Likelihood). Suppose the observation graph G is connected, the choice function F : R [0, 1] satisfies (9), and Assumption 2.3 holds. Then, there exists a unique optimal solution w Wb satisfying (8).

The proof is provided in Appendix A.2. It follows from Proposition 2.4 and Gibbs inequality that when the pairwise comparison model is indeed a TF model with weight vector w, then we have w = w.

2.3. Minimax Formulation

Given any fixed graph G, separation level ϵ > 0, choice function F, and constants δ > 0 and b F 1(1 δ)/2, define the sets M0 and M1(ϵ) of TF and pairwise comparison models:

M0 {P : Assumption 2.3 holds and

w Wb such that P = F(w)}, (10)

M1(ϵ) P : Assumption 2.3 holds and

inf w Wb 1 n P F(w) F ϵ , (11)

where F denotes Frobenius norm. Now, we formalize the hypothesis testing problem in (1) as:

H0 : Z P M0,

H1 : Z P M1(ϵ). (12)

Hypothesis Testing for Generalized Thurstone Models

We will discuss the separation distance infw Wb P F(w) F later. For now, note that we only test on the set of observed comparisons E as it is not possible to determine whether the comparisons on Ec would conform to a TF model or some other pairwise comparison model. Next, for any fixed graph G, choice function F, and constants ϵ, δ, b, we define the minimax risk as

R(G, ϵ) inf ϕ

sup P M0 PH0(ϕ(Z) = 1) | {z } Z P under H0

sup P M1(ϵ) PH1(ϕ(Z) = 0)

| {z } Z P under H1

where the infimum is taken over all randomized decision rules ϕ(Z) {0, 1} (with 0 corresponding to H0 and 1 to H1), and PH0 and PH1 denote the probability measures under hypotheses H0 and H1, respectively. Intuitively, this risk minimizes the sum of the worst-case type I and type II error probabilities. Finally, we define the critical threshold of the hypothesis testing problem in (12) as the smallest value of ϵ for which the minimax risk is bounded by 1

2 (cf. (Rastogi et al., 2022)):

εc inf ϵ > 0 : R(G, ϵ) 1

Note that the constant 1

2 here is arbitrary and can be replaced by any constant in (0, 1).

3. Main Results

In this section, we present the main results of the paper. We first show that our notion of separation distance can be simplified for analysis, then proceed to bound the critical threshold and minimax risk, and finally, establish type I and II error probability bounds in the sequential setting.

3.1. Separation Distance and Test Statistic

Recall that to formalize (12), we defined the separation distance of a pairwise comparison model P to the class of TF models as infw Wb P F(w) F (for fixed F). To make this separation distance more amenable to theoretical analysis, we approximate it in the next theorem with the simpler quantity P F(w ) F, where w is given in (8). Theorem 3.1 (Separation Distance to TF Models). Let P be a pairwise comparison matrix satisfying Assumption 2.3. Then, there exists a constant c1 > 0 (that does not depend on n) such that the separation distance between P and the class of TF models satisfies

c1 P F(w ) F inf w Wb P F(w) F P F(w ) F,

where w is given by (8).

The proof is provided in Appendix A.3. The upper bound is immediate, and the lower bound utilizes the informationtheoretic bounds between f-divergences.

Test statistic: We now introduce our test statistic based on the approximation derived in Theorem 3.1. First, we partition the observed comparison data Z into two (roughly) equal parts Z1 = {Zm ij : (i, j) E, m [ kij/2 ]} and Z2 = Z \ Z1. The first half of the dataset Z1 is used to estimate the parameters ˆw as shown in (7). Then, we use Z2 to calculate the test statistic T via

k ij(k ij 1) + F( ˆwi ˆwj)2

2F( ˆwi ˆwj)Zij

1k ij>1, (15)

where k ij = kij kij/2 , Zij = P

m> kij/2 Zm ij is computed as before but using only the samples in Z2, and 1A denotes the indicator function on A. For an intuitive explanation of the definition of this statistic, consider the quantity

k ij(k ij 1) +F(w i w j )2 2F(w i w j )Zij

(16) obtained by substituting w in place ˆw in (15). Then, the expected value of T is P F(w ) 2 F. This is because the expected value of the first term is p2 ij, and the last term is 2F(w i w j )pij. Hence, T is constructed by plugging in ˆw in place of w in an unbiased estimator of P F(w ) 2 F. Our proposed hypothesis test thresholds T to determine the unknown hypothesis; H1 is selected if T exceeds a certain threshold. There is some resemblance between the test statistic T and those in (Makur & Singh, 2023; 2024; Rastogi et al., 2022) since all these statistics are estimators for squared Frobenius distances of pairwise comparison models. However, the techniques used to analyze them are very different. For example, our theoretical analysis crucially relies on sample splitting, while the other two do not. We will discuss analytical expressions for the threshold next and a data-driven approach to determining the threshold in Section 4.

3.2. Upper Bound on Critical Threshold

In this section, we make the simplifying assumption that kij = 2k (with k N) for all (i, j) E. The ensuing theorem, proved in Appendix B, establishes an upper bound on the critical threshold of the hypothesis testing problem defined in (12).

Theorem 3.2 (Upper Bound on Critical Threshold). Consider the hypothesis testing problem in (12), and assume that Assumption 2.3 holds and k 2. Then, there exists a

Hypothesis Testing for Generalized Thurstone Models

constant c2 > 0 such that the critical threshold defined in (14) is upper bounded by

In our analysis, we select H1 if T > γ n

k and H0 otherwise, where γ is an appropriate constant independent of n, k (see (37)). The analysis relies on establishing non-trivial error bounds (in L semi-norm) for parameter estimation of TF models when the data is generated by a general pairwise comparison model, which is not necessarily a GTM (i.e., deriving error bounds under a potential model mismatch). The error bounds allow us to prove bounds on the mean and variance of the test statistic T under both hypotheses H0 and H1. Then, using Chebyshev s inequality, we can bound the probabilities of error of our test under each of the hypotheses, which induces an upper bound on the critical threshold.

We also note that in the special case where TF is a BTL model, our upper bound on εc recovers the bound in (Makur & Singh, 2023; 2024) for complete graphs. But our likelihood-based proof is quite different to the spectral ideas in (Makur & Singh, 2023; 2024). Finally, we present the key error bounds for parameter estimation when data is generated by a general pairwise comparison model needed to prove Theorem 3.2.

Theorem 3.3 (Error Bounds for Parameter Estimation). Consider any pairwise comparison model satisfying Assumption 2.3 with w given by (8) and ˆw constructed according to (7) from data generated by the model. Then, for some constant c3 > 0, the following tail bound holds on the estimation error of w :

t 1, P ˆw w 2 L c3nβ2

α2k F( 2b)2 t e t,

where α is defined in (9). Moreover, for any p 1, there exists a p-dependent constant c(p) > 0 such that the expected pth moment of the error is bounded by

E[ ˆw w p L] c(p)nβ2

α2k F( 2b)2

The proof is provided in Appendix A.4. In the special case where the pairwise comparison model is a GTM, our bounds recover the bounds derived in (Shah et al., 2016, Theorem 3) up to constants. However, our result is much more general because it holds for any pairwise comparison model; this requires a careful formulation and development of the proof techniques.

We remark that these error bounds can be readily converted into ℓ2-error bounds using the relation ˆw w 2 L

λ2(L) ˆw w 2 2, where λ2(L) is the second smallest eigenvalue of the Laplacian L. The connectedness of the graph ensures that λ2(L) > 0, and the value of λ2(L) is known for various classes of graphs (cf. (Chung, 1997)).

Moreover, it is worth emphasizing that our error bounds in L semi-norm (in Theorem 3.3) remain valid even when the observation graph is disconnected. Furthermore, since the proof of Theorem 3.2 relies only on these bounds, the results of Theorem 3.3 hold regardless of graph connectivity. When the graph G is disconnected, solutions ˆw and w of (7) and (8) may not be unique, but the error ˆw w L is well-defined as the non-unique component lies in the null space of L. Thus, our upper bounds on critical thresholds still hold. This ensures that our testing framework for TF models remains applicable even in settings where parameter estimation is infeasible due to a disconnected observation graph, highlighting a fundamental distinction between testing and parameter estimation of TF models. Moreover, we also note that our results lead to bounds on critical thresholds even when the separation distance is defined in other norms; see Appendix G for more details.

3.3. Information-Theoretic Lower Bounds

We now establish information-theoretic lower bounds on the minimax risk and critical threshold for the hypothesis testing problem in (12). For simplicity and analytical tractability, assume that kij = k N for all (i, j) E, and assume that the observation graph G is super-Eulerian (Catlin, 1992), i.e., it has an Eulerian spanning sub-graph G = ([n], E) so that every vertex of G has even degree. Then, G has a cycle decomposition C by Veblen s theorem (Biggs et al., 1976; Seshadri & Ugander, 2019), where C is a collection of simple cycles σ that partitions the undirected edges of G. The ensuing theorem, proved in Appendix C, presents our minimax risk lower bound. Theorem 3.4 (Minimax Lower Bound). Consider the hypothesis testing problem in (12) and assume that the observation graph G is super-Eulerian with spanning Eulerian sub-graph G. Then, there exists a constant c4 > 0 such that for any ϵ > 0, the minimax risk in (13) is lower bounded by

R(G, ϵ) 1 1

where |σ| denotes the length of a cycle σ C, and C is the cycle decomposition of G.

Our argument utilizes the Ingster-Suslina method (Ingster, 1994; Ingster & Suslina, 2003), which is similar to Le Cam s method, but provides a lower bound by considering a cleverly chosen point and a mixture on the parameter space instead of just two points. Our specific construction is inspired by the technique introduced in (Seshadri & Ugander,

Hypothesis Testing for Generalized Thurstone Models

2019), which establishes a lower bound for testing the IIA assumption (i.e., BTL models) for Eulerian graph structures. We extend their approach in three ways. First, we generalize their method to accommodate any GTM rather than just the BTL model. Second, we use a different technique based on Theorem 3.1 to lower bound separation distance from the class of TF models. Moreover, our work quantifies separation using Frobenius norm instead of sums of total variation (TV) distances. Third, our argument holds for a broader class of graphs, namely, super-Eulerian graphs. Note that the question of algorithmically constructing Eulerian subgraphs of graphs has been widely studied (Haghparast & Kiani, 2019).

The following proposition simplifies Theorem 3.4 to obtain lower bounds on the critical threshold for several classes of graphs.

Proposition 3.5 (Lower Bounds on Critical Threshold). Under the assumptions of Theorem 3.4, the following lower bounds hold for the critical threshold defined in (14):

1. If G is a complete graph with odd n vertices, then ε2 c = Ω(1/nk).

2. If G is a d-regular graph with constant d 2, then ε2 c = Ω(1/n2k).

3. If G is a single cycle graph with n vertices, then ε2 c = Ω(1/n2k).

4. If G is a two-dimensional n n toroidal grid on n vertices formed by the Cartesian product of two cycles of length n, then ε2 c = Ω(1/n7/4k).

The proof of Proposition 3.5 is provided in Appendix C.1. It involves calculating the number of simple cycles and the individual cycle lengths in the cycle decompositions C. The lower bounds on εc are then obtained from Theorem 3.4. We remark that our minimax upper and lower bounds on εc match for the complete graph case, demonstrating the minimax optimality of the threshold s scaling (up to constant factors). Moreover, they also match with respect to k for other classes of graphs as well. It is worth mentioning that in the special case of BTL models with single cycle graphs, our lower bound on εc improves the high-level scaling behavior in (Seshadri & Ugander, 2019) from Ω(1/

n3k) to Ω(1/

n2k) (when εc is quantified in terms of Frobenius norm). Lastly, we remark that for single cycle graphs, the gap between the upper and lower bounds in terms of n intuitively holds because our lower bounds become larger when there are more cycles in C, which is only 1 in this case.

3.4. Upper Bounds on Probabilities of Type I and II Errors

To complement the minimax risk lower bound in Theorem 3.4, we establish upper bounds on the extremal type

I and II error probabilities. We will do this in the sequential setting, where data is observed incrementally a common practical scenario which subsumes the standard fixed sample-size setting, cf. (Manole & Ramdas, 2023; Howard et al., 2020). In the sequential testing framework, at each time step, we observe a single i vs. j comparison for every (i, j) E. (The subsequent analysis can be extended to a general setting where we observe only one comparison for some pair (i, j) E or even a variable number of comparisons at every time step.) At time k1 + k with k1, k N, we define T k1,k to be the value of the test statistic T in (15), where comparisons from k1 time-steps have been used to build the dataset Z1 to estimate parameters using ˆw, and k time-steps have been used to build the dataset Z2 to calculate the statistic T. Note that Z1 and Z2 no longer need to be similar in size. Then, we can decide based on thresholding T k1,k (see Theorem 3.7) whether to collect more data or stop and reject H0 while controlling the probabilities of error. If the testing process ends without rejecting H0, then we can accept H0. A key observation underlying our analysis is the following reverse martingale property (see, e.g., (Howard et al., 2020; Manole & Ramdas, 2023)).

Proposition 3.6 (Reverse Martingale). Fix any k1 N, and let Fk = N

(i,j) E σ(Pk1+k m=k1+1 Zm ij , Zk1+k+1 ij ,

Zk1+k+2 ij , . . . ) be a non-increasing sequence of σ-algebras, where N denotes the product σ-algebra. Then, the sequence of statistics {T k1,k : k 2} is a reverse martingale with respect to the reverse filtration {Fk :k 2}, i.e., for k 2, T k1,k is Fk-measurable and E T k1,k|Fk+1 =T k1,k+1.

The proof is presented in Appendix D.1. This observation allows us to develop time-uniform bounds in terms of k on the probabilities of type I and II errors, i.e., they hold for all k larger than a constant. The next theorem, proved in Appendix D.3, presents our bounds on the probabilities of type I and II errors.

Theorem 3.7 (Type I and Type II Error Probability Bounds). Under the sequential setting discussed above, the following bounds hold on the extremal type I and type II error probabilities. There exist constants c5, c6, c7, c8 such that for all t 1, ν (0, 1/e), k1 N, and ϵ c5

t/ nk1, we have

sup P M0 PH0

k 2, T k1,k c6t n

sup P M1(ϵ) PH1

k 2, T k1,k D c5t 1 2 n 1 2 /k

1 2 ℓk,ν k 4D +c8t 1 2 n 1 2 /k

Hypothesis Testing for Generalized Thurstone Models

where D P F(w ) F and ℓk,ν log(3.5 log2(k)2/ν).

We now make several remarks. Firstly, our error probability bounds encode the scalings of the thresholds to accept or reject H0 (see (37)). Secondly, our bounds hold regardless of how the decision-maker assigns data collected at different time-steps to Z1 and Z2. Moreover, they provide insights on how to split the data based on the topology of the observation graph, e.g., the bounds suggest an equal split of the data for complete graphs, whereas for a single cycle, achieving better type I error control requires a larger value of k1. To illustrate this and help parse Theorem 3.7, we present corollaries of Theorem 3.7 for the complete and single cycle graph cases in Appendix E.

Thirdly, our bounds clearly hold in the non-sequential fixed sample-size setting, as we can just fix a particular value of k. Hence, adding the two extremal probabilities of error yields upper bounds on the minimax risk. Notably, the proof of Theorem 3.7 requires us to develop a time-uniform version of the well-known Hanson-Wright inequality (Rudelson & Vershynin, 2013) specialized for our setting (see Lemma D.2 in Appendix D.2). Additionally, as an intermediate step in the proof, we also obtain time-uniform confidence intervals under the null hypothesis H0, as demonstrated in the following proposition. Proposition 3.8 (Confidence Interval for T k1,k). Suppose ˆw is estimated as in (7) from the comparisons over k1 timesteps. Then, there exists a constant c7 > 0 such that for all ν (0, 1/e) and k1 N,

k 2, T k1,k F( ˆw) F(w ) 2 F + c7

+ 4 F( ˆw) F(w ) F

Proposition 3.8 is established in Appendix D.2. We remark that the distribution of F( ˆw) F(w ) F above can be approximated either by leveraging the asymptotic normality of ˆw w (Gao et al., 2023; Han et al., 2024), or by utilizing bootstrapping techniques; this gives (1 2ν) timeuniform confidence intervals. Additionally, the constant c7 here is also the constant in our specialized version of Hanson-Wright inequality (noted above) and can be approximated via simulations for our setting. An empirical investigation into estimating the constant c7 and the subsequent confidence intervals can be found in Appendix F.3.

4. Experiments

In this section, we develop a data-driven approach to select the threshold for our test T and conduct simulations to validate our theoretical results on synthetic and realworld datasets. We include several additional experiments

in Appendix F to validate our threshold estimation procedure, compare our performance with the spectral method of (Makur & Singh, 2023; 2024) for the BTL model, and provide additional experiments on real-world datasets.

Estimating the threshold: Given a pairwise comparison dataset Z {Zm ij : (i, j) E, m [kij]}, we employ an empirical-quantile-based approach to determine the critical threshold for our hypothesis testing problem. We generate multiple TF models with random skill scores w Rn drawn independently and uniformly in [ b, b] and translated to satisfy w T1 = 0, and simulate kij i vs. j comparisons by sampling binomial random variables { Zij Bin(kij, F(wi wj)}(i,j) E. We then compute the test statistic T for each simulated dataset, repeating the process a sufficient number of times to build a distribution of test statistics. Finally, we extract the 95th percentile value from this distribution as our empirical threshold.

In our first experiment, we investigate the behavior of the thresholds for T based on this empirical-quantile-based approach for various values of n and k and for different graph topologies and TF models. We considered values of n ranging from 15 to 55 with intervals of 10, and set kij = k for all (i, j) E with k {12, 20}, graph topologies including complete graphs, n n toroidal grids, and sparse graphs generated from Erd os-R enyi G(n, p) models with parameter p = 2 log2(n)/n, and TF models such as standard Thurstone (Case V) and BTL models. For each choice of parameters, we generated 400 models by randomly sampling weights with b = F 1(0.98)/2 and generated synthetic comparison data. The scaled test statistic k T/n was computed for every parameter choice, and the 95th percentile value of this scaled T was identified as the threshold γ. Figure 1 plots these 95th percentile values with respect to n for various parameter choices. Notably, the value γ remains roughly constant with n, k and model TF for complete graphs. For all these cases, the values stabilize approximately to a certain constant, illustrating that the thresholds obtained via the empirical-quantile-based approach follow the same high-level scaling as the theoretical threshold in (37).

In our next experiment, we apply our test to the LMSYS chatbot leaderboard (Chiang et al., 2024), a widely used benchmark for evaluating the performance of LLMs. The dataset contains a collection of pairwise comparisons between various LLMs based on their response to prompts, which are then used to obtain Elo ratings. We retain the directional nature of comparisons, where an i vs. j comparison indicates model i as the first response and j as the second during the evaluation. We rank the LLMs based on their frequency of appearance in the dataset and perform the test repeatedly on the top-n LLMs in this ordering, with n ranging from 5 to 21 with gaps of 2, for both Thurstone and

Hypothesis Testing for Generalized Thurstone Models

20 30 40 50 60 n

95th Percentile Value

Thurstone, 2D Grid, k=12 Thurstone, 2D Grid, k=20 BTL, 2D Grid, k=12 BTL, 2D Grid, k=20 Thurstone, Complete, k=12 Thurstone, Complete, k=20

BTL, Complete, k=12 BTL, Complete, k=20 Thurstone, G(n,p), k=12 Thurstone, G(n,p), k=20 BTL, G(n,p), k=12 BTL, G(n,p), k=20

Figure 1: Estimated scaled threshold for various values of n, k, graph topologies, and TF models.

5 7 9 11 13 15 17 19 21 23 n

Normalized Values

kavg T/n for BTL

kavg T/n for Thurstone

Threshold for BTL Threshold for Thurstone

Figure 2: Scaled test statistics and the estimated thresholds evaluated on the LMSYS dataset.

BTL models. For each n, we plot the values in Figure 2 of the (scaled) test statistic kavg T/n and the obtained (scaled) thresholds using the quantile approach (with the same parameters as above), where kavg is the average of kij over all (i, j) E. By randomizing over the partitioning of the dataset Z into Z1 and Z2 and computing T each time, we essentially obtain a distribution of T and plot these values in Figure 2 as a scatter plot. The figure highlights that both BTL and Thurstone models perform well in modeling for smaller values of n with only 10% of samples above the threshold, but exhibit significant deviations for larger values of n as around 60% of samples are above the threshold for n = 21. The 60% of samples above the threshold can be interpreted as a bootstrapped estimate of the test s power indicating 40% chance that the model lies within 95% statistical deviations from BTL. The figure also shows that deviation from Thurstone increases as n increases, a top-9 batch size provides a statistically accurate fit to the Thurstone model, and the deviations are significant for n 21. The results suggest that when using a multi-tier Elo rating

system (Brams & Ismail, 2024), a group size of approximately 9 leads to accurate modeling for the top-n models, ranked based on their total available data.

5. Conclusion

In this work, we developed a rigorous testing framework to determine whether pairwise comparison data is generated by an underlying generalized Thurstone model with a given choice function F. We derived both upper and lower bounds on the critical threshold of our testing problem, which depended on the topology of the observation graph. These bounds were shown to be tight for certain graph classes, such as complete graphs. In addition, we proposed a hypothesis test based on our notion of separation distance and established theoretical guarantees for this test, including time-uniform bounds on type I and II error probabilities as well as a minimax lower bound on the risk of the testing problem. Alongside this, auxiliary results such as error bounds for parameter estimation and confidence intervals under the null hypothesis were derived. To validate our findings, we conducted experiments on both synthetic and real-world datasets and introduced a data-driven approach for determining the test threshold.

This study opens up several avenues for future research. For instance, extending the hypothesis testing framework to handle general multi-way comparisons rather than pairwise comparisons is one such direction. Another direction is developing active testing techniques within the framework of generalized Thurstone models that optimize test performance. Finally, extending the testing and estimation frameworks for such models to dependent data is also an important avenue as real-world data often exhibits correlations that could affect inference results.

Acknowledgements

This work was supported by the National Science Foundation (NSF) CAREER Award under Grant CCF-2337808.

Impact Statement

This work is primarily theoretical and does not have immediate societal consequences. However, if practitioners apply our approaches to make decisions in real-world settings, there could be both positive and negative implications. On the one hand, our methods can help determine accurate models for comparison data, such as whether data conforms to a TF model, or help detect certain kinds of biases in data. On the other hand, our methods concern ranking models and using any ranking mechanism without due care in the realworld may unintentionally exacerbate existing inequalities, such as unequal access to resources and opportunities.

Hypothesis Testing for Generalized Thurstone Models

Biggs, N. L., Lloyd, E. K., and Wilson, R. J. Graph Theory 1736-1936. Oxford University Press, Oxford, UK, 1976.

Bong, H., Li, W., Shrotriya, S., and Rinaldo, A. Nonparametric estimation in the dynamic Bradley-Terry model. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 3317 3326, Palermo, Italy, August 26-28 2020.

Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs. I. The method of paired comparisons. Biometrika, 39(3/4):324 345, December 1952.

Brams, S. J. and Ismail, M. S. Multi-tier tournaments: Matching and scoring players. ar Xiv:2407.13845 [econ.TH], July 2024.

Busa-Fekete, R., Fotakis, D., Sz or enyi, B., and Zampetakis, M. Identity testing for Mallows model. In Advances in Neural Information Processing Systems 34 (Neur IPS), pp. 23179 23190, December 6 14 2021.

Catlin, P. A. Supereulerian graphs: A survey. Journal of Graph Theory, 16(2):177 196, June 1992.

Cattelan, M., Varin, C., and Firth, D. Dynamic Bradley Terry modelling of sports tournaments. Journal of the Royal Statistical Society, Series C (Applied Statistics), 62 (1):135 150, 2013.

Chatterjee, S. Matrix estimation by universal singular value thresholding. The Annals of Statistics, 43(1):177 214, February 2015.

Chen, Y., Fan, J., Ma, C., and Wang, K. Spectral method and regularized MLE are both optimal for top-K ranking. The Annals of Statistics, 47(4):2204 2235, 2019.

Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhang, H., Zhu, B., Jordan, M., Gonzalez, J. E., and Stoica, I. Chatbot arena: An open platform for evaluating LLMs by human preference. ar Xiv:2403.04132 [cs.AI], March 2024.

Chierichetti, F., Kumar, R., and Tomkins, A. On the number of trials needed to distinguish similar alternatives. Proceedings of the National Academy of Sciences (PNAS), 119(31), July 2022.

Chung, F. R. K. Spectral Graph Theory, volume 92 of Conference Board of the Mathematical Sciences (CBMS) Regional Conference Series in Mathematics. American Mathematical Society, Providence, RI, USA, 1997.

Clarke, S. R. and Norman, J. M. Home ground advantage of individual clubs in english soccer. Journal of the Royal Statistical Society, Series D (The Statistician), 44(4):509 521, 1995.

Davidson, D. and Marschak, J. Experimental tests of a stochastic decision theory (1959). In Economic Information, Decision, and Prediction, volume 7-1 of Theory and Decision Library, pp. 133 171, Dordrecht, Netherlands, 1974. Springer.

Dawkins, R. A threshold model of choice behaviour. Animal Behaviour, 17(1):120 133, February 1969.

Durrett, R. Probability: Theory and Examples. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, New York, NY, USA, fifth edition, 2019.

Elo, A. E. The Rating of Chessplayers, Past and Present. Arco Publishing, Inc., New York, NY, USA, second edition, 1986.

Fan, J., Hou, J., and Yu, M. Uncertainty quantification of MLE for entity ranking with covariates. Journal of Machine Learning Research, 25(358):1 83, August 2024.

Feder, T. and Subi, C. Packing edge-disjoint triangles in given graphs. Electronic Colloquium on Computational Complexity (ECCC), (13):1 12, February 2012.

Gao, C., Shen, Y., and Zhang, A. Y. Uncertainty quantification in the Bradley-Terry-Luce model. Information and Inference: A Journal of the IMA, 12(2):1073 1140, June 2023.

Haghparast, N. and Kiani, D. Spanning Eulerian subgraphs of large size. Graphs and Combinatorics, 35(1):201 206, January 2019.

Hahn, G. J. and Meeker, W. Q. Statistical Intervals: A Guide for Practitioners. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., New York, NY, USA, 1991.

Han, R., Tang, W., and Xu, Y. Statistical inference for pairwise comparison models. ar Xiv:2401.08463 [math.ST], January 2024.

Herbrich, R., Minka, T., and Graepel, T. True Skill TM: A Bayesian skill rating system. In Advances in Neural Information Processing Systems 19 (Neur IPS), pp. 1 8, Vancouver, BC, Canada, December 4-9 2006.

Howard, S. R., Ramdas, A., Mc Auliffe, J. D., and Sekhon, J. S. Time-uniform Chernoff bounds via nonnegative supermartingales. Probability Surveys, 17:257 317, 2020.

Hunter, D. R. MM algorithms for generalized Bradley Terry models. The Annals of Statistics, 32(1):384 406, February 2004.

Hypothesis Testing for Generalized Thurstone Models

Ingster, Y. I. Minimax detection of a signal in ℓp metrics. Journal of Mathematical Sciences, 68(4):503 515, February 1994.

Ingster, Y. I. and Suslina, I. A. Nonparametric Goodnessof-Fit Testing Under Gaussian Models, volume 169 of Lecture Notes in Statistics. Springer, New York, NY, USA, 2003.

Jadbabaie, A., Makur, A., and Shah, D. Estimation of skill distribution from a tournament. In Advances in Neural Information Processing Systems 33 (Neur IPS), pp. 8418 8429, Vancouver, BC, Canada, December 6-12 2020.

Jadbabaie, A., Makur, A., and Shah, D. Estimation of skill distributions. IEEE Transactions on Information Theory, 70(9):6447 6480, September 2024.

Jannach, D., Resnick, P., Tuzhilin, A., and Zanker, M. Recommender systems beyond matrix completion. Communications of the ACM, 59(11):94 102, November 2016.

Kirkman, T. P. On a problem in combinations. The Cambridge and Dublin Mathematical Journal, 2:191 204, 1847.

Lauga, N. NBA games data. Kaggle, 2023. URL https://www.kaggle.com/datasets/ nathanlauga/nba-games/data. Accessed: January 1, 2025.

Liu, Y., Fang, E. X., and Lu, J. Lagrangian inference for ranking problems. Operations Research, 71(1):202 223, January 2023.

Luce, R. D. Individual Choice Behavior: A Theoretical Analysis. John Wiley & Sons, Inc., New York, NY, USA, 1959.

Makur, A. Information Contraction and Decomposition. Sc.D. thesis in Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA, May 2019.

Makur, A. and Singh, J. Testing for the Bradley-Terry Luce model. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), pp. 1390 1395, Taipei, Taiwan, June 25-30 2023.

Makur, A. and Singh, J. Minimax hypothesis testing for the Bradley-Terry-Luce model. ar Xiv:2410.08360 [cs.LG], October 2024.

Makur, A. and Zheng, L. Comparison of contraction coefficients for f-divergences. Problems of Information Transmission, 56(2):103 156, April 2020.

Mallows, C. L. Non-null ranking models. I. Biometrika, 44 (1/2):114 130, June 1957.

Mania, H., Ramdas, A., Wainwright, M. J., Jordan, M. I., and Recht, B. On kernel methods for covariates that are rankings. Electronic Journal of Statistics, 12:2537 2577, 2018.

Manole, T. and Ramdas, A. Martingale methods for sequential estimation of convex functionals and divergences. IEEE Transactions on Information Theory, 69(7):4641 4658, July 2023.

Marschak, J. Binary-choice constraints on random utility indicators (1960). In Economic Information, Decision, and Prediction, volume 7-1 of Theory and Decision Library, pp. 218 239, Dordrecht, Netherlands, 1974. Springer.

Mc Cullagh, P. and Nelder, J. A. Generalized Linear Models, volume 37 of Monographs on Statistics and Applied Probability. Chapman and Hall, New York, NY, USA, second edition, 1989.

Mc Fadden, D. Conditional logit analysis of qualitative choice behavior. In Zarembka, P. (ed.), Frontiers in Econometrics, Economic Theory and Mathematical Economics, pp. 105 142, New York, NY, USA, 1973. Academic Press.

Mc Laughlin, D. H. and Luce, R. D. Stochastic transitivity and cancellation of preferences between bitter-sweet solutions. Psychonomic Science, 2(3):89 90, January 1965.

Morley, B. and Thomas, D. An investigation of home advantage and other factors affecting outcomes in english one-day cricket matches. Journal of Sports Sciences, 23 (3):261 268, March 2005.

Negahban, S., Oh, S., and Shah, D. Iterative ranking from pair-wise comparisons. In Advances in Neural Information Processing Systems 25 (Neur IPS), pp. 2474 2482, Lake Tahoe, NV, USA, December 3-8 2012.

Negahban, S., Oh, S., and Shah, D. Rank centrality: Ranking from pairwise comparisons. Operations Research, 65(1): 266 287, January 2017.

Nelder, J. A. and Wedderburn, R. W. M. Generalized linear models. Journal of the Royal Statistical Society, Series A (General), 135(3):370 384, 1972.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35 (Neur IPS), pp. 27730 27744, New Orleans, LA, USA, November 28-December 9 2022.

Hypothesis Testing for Generalized Thurstone Models

Plackett, R. L. The analysis of permutations. Journal of the Royal Statistical Society, Series C (Applied Statistics), 24 (2):193 202, 1975.

Rastogi, C., Balakrishnan, S., Shah, N. B., and Singh, A. Two-sample testing on ranked preference data and the role of modeling assumptions. Journal of Machine Learning Research, 23(225):1 48, September 2022.

Rudelson, M. and Vershynin, R. Hanson-Wright inequality and sub-Gaussian concentration. Electronic Communications in Probability, 18:1 9, 2013.

Seshadri, A. and Ugander, J. Fundamental limits of testing the independence of irrelevant alternatives in discrete choice. In Proceedings of the 20th ACM Conference on Economics and Computation (EC), pp. 65 66, Phoenix, AZ, USA, June 24-28 2019.

Seshadri, A., Ragain, S., and Ugander, J. Learning rich rankings. In Advances in Neural Information Processing Systems 33 (Neur IPS), pp. 9435 9446, Vancouver, BC, Canada, December 6 12 2020.

Shah, N. B., Balakrishnan, S., Bradley, J., Parekh, A., Ramchandran, K., and Wainwright, M. J. Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence. Journal of Machine Learning Research, 17(58):1 47, 2016.

Simons, G. and Yao, Y.-C. Asymptotics when the number of parameters tends to infinity in the Bradley-Terry model for paired comparisons. The Annals of Statistics, 27(3): 1041 1060, June 1999.

Thurstone, L. L. A law of comparative judgment. Psychological Review, 34(4):273 286, 1927.

Thurstone, L. L. The indifference function. The Journal of Social Psychology, 2(2):139 167, 1931.

Tversky, A. Elimination by aspects: A theory of choice. Psychological Review, 79(4):281 299, July 1972.

Vojnovic, M. and Yun, S. Parameter estimation for generalized Thurstone choice models. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 498 506, New York, NY, USA, June 19-24 2016.

Yellott, Jr., J. I. The relationship between Luce s choice axiom, Thurstone s theory of comparative judgment, and the double exponential distribution. Journal of Mathematical Psychology, 15(2):109 144, April 1977.

Zermelo, E. Die berechnung der turnier-ergebnisse als ein maximumproblem der wahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 29(1):436 460, December 1929.

Zhao, S., Zhou, E., Sabharwal, A., and Ermon, S. Adaptive concentration inequalities for sequential decision problems. In Advances in Neural Information Processing Systems 29 (Neur IPS), pp. 1351 1359, Barcelona, Spain, December 5-10 2016.

Hypothesis Testing for Generalized Thurstone Models

A. Proofs of Proposition 2.4 and Theorems 3.1 and 3.3

A.1. Additional Notation and Preliminaries

We begin by introducing some additional notations and discussing some necessary preliminaries that will be used throughout our proofs. To simplify our notation, we use ˆl(w) to denote l(w; {ˆpij : (i, j) E}) where ˆpij are computed based on the partitioned dataset Z1. Similarly, we use l (w) to denote l(w; {pij : (i, j) E}) where pij are the actual underlying pairwise comparison probabilities. When w = w or w = ˆw, we simplify the notation by using F and ˆF to denote the matrices F(w ) and F( ˆw) (cf. (4)) respectively, for brevity. We say that a random variable X is µ2-sub-Gaussian, if it satisfies the condition, log(E[exp(s X)]) µ2s2/2 for all s R.

Notably, ˆw is computed as in (7) even though the data may not conform to an underlying TF model. Throughout the appendices, we denote various constants using overlapping labels, such as c, c1, c2, . . . to simplify our notation and facilitate readability. Moreover, we define E+ to denote the set {(i, j) E : j > i}.

A.1.1. PRELIMINARIES

Recall that, as defined in (8), w represents the weights of a TF model that best approximates the pairwise comparison model {pij, (i, j) E}. Now, we show that any pairwise comparison model can be converted into its skew-symmetric

counterpart n pij+1 pji

2 , (i, j) E o such that both of them share the same optimal weights.

Lemma A.1 (Skew-symmetrized Model). For a symmetric edge set E, the pairwise model {pij : (i, j) E} and and its

skew-symmetric counterpart n pij+1 pji

2 : (i, j) E o has the have the same optimal TF weights w as defined in (8).

Proof: Note that the weighted negative log-likelihood objective can be written as

l (w) = arg min w W:w T1=0 X

(i,j) E pij log(F(wi wj)) + (1 pij) log(1 F(wi wj))

ζ1= arg min w W:w T1=0 X

(i,j) E pij log(1 F(wj wi)) + (1 pij) log(F(wj wi))

ζ2= arg min w W:w T1=0 X

pij + 1 pji

log(F(wi wj)) + 1 pij + pji

log(F(wj wi))

where ζ1 follows since F( x) = 1 F(x) and ζ2 follows by adding the first two equations and dividing by two.

Therefore, for any pairwise comparison model {pij : (i, j) E}, we can define its skew-symmetrized counterpart {qij : (i, j) E} where

(i, j) E, qij pij + 1 pji

We call these transformed probabilities qij as skew-symmetrized probabilities because we have qij + qji = 1, and thereby this transformation effectively removes any distinctions between i vs. j and j vs. i comparisons. Also, note that for any pairwise comparison model satisfying Assumption 2.3, its skew-symmetrized model also satisfies it. In a similar manner, we can define ˆqij = ˆpij+1 ˆpji

2 as the skew-symmetrized version of the empirical probabilities. With this notation in place, we are ready to state the proof of Proposition 2.4 below.

A.2. Proof of Proposition 2.4

Uniqueness: The uniqueness of w follows directly from the strong log-concavity of F( ). This is because if v , w Wb are any two non-unique solutions of (8) such that l (v ) = l (w ), then by strong log-concavity of F and the fact that qij > 0 for (i, j) E along with connectedness of graph, for any θ (0, 1), we have

θl (v ) + (1 θ)l (w ) = 2θ X

(i,j) E qij log(F v i v j ) 2(1 θ) X

(i,j) E qij log(F(w i w j ))

Hypothesis Testing for Generalized Thurstone Models

(i,j) E qij log F θ v i v j + (1 θ) w i w j

= l (θv + (1 θ)w ).

This gives a contradiction since θv + (1 θ)w achieves a higher likelihood (or a lower objective value). The existence of w under Assumption 2.3 and finite b follows from the extreme value theorem since a continuous function is being optimized over a compact set. Notably, the existence of w also holds for disconnected graphs. As a bonus, we provide a proof of existence even when the parameter b = , but for a connected graph G.

Existence: Now, we will utilize the connectedness of graph G and Assumption 2.3 to show the existence of w . Define a sequence {w(m) Rn : m N {0}} as

w(m) = argmin w1=0: w m l (w).

Clearly, w(m) exists as the optimization of a convex function l ( ) is being performed over a compact set. Define the following sets as the components of w that potentially diverge to :

S+ = i [n] : lim sup m

i = + , S = n i [n] : lim inf m w(m)

We will show that S+ = S = . Notably, if S+ = , then we consider the partition of [n] as S+ Sc +. Clearly, 1 Sc + = . Since the observation graph G is connected, for some i S+ there exists j Sc + such that qji > 0 (by Assumption 2.3). This implies that qji log(F(w(m) j w(m) i )) + as m + . Hence, we can find a constant A > 0 such that on the set {wi wj A}, we have

qji log(F(wi wj)) > l (w(0)).

Equivalently, for any w with wi wj > A, we have

l (w) qji log(F(A)) > l (w(0)) l (w(m)),

where the first inequality follows since each term in l ( ) is non-negative. Therefore, we must have w(m) i w(m) j + A for all k N. Since i S+, it follows that j S+ by definition, which contradicts the assumption that j Sc +. Hence, we conclude that S+ = . A similar argument shows that S = . The fact that S+ = S = implies that the sequence {w(m) : m N {0}} admits a convergent subsequence, which proves the existence of w .

A.3. Proof of Theorem 3.1

The upper bound is trivial to prove

inf w Wb P F(w) F P F(w ) F = P F F,

where F is the pairwise probability matrix associated with the optimal weights. Now to prove the lower bound, observe that

(i,j) E (pij F(wi wj))2

inf w Wb F( 2b)(1 F( 2b)) X

(pij F(wi wj))2

F(wi wj)(1 F(wi wj))

ζ1= c inf w Wb

(i,j) E χ2(Bernoulli(pij) Bernoulli(F(wi wj)))

ζ2 c inf w Wb

(i,j) E DKL(Bernoulli(pij) Bernoulli(F(wi wj)))

= c inf w Wb

(i,j) E pij log pij F(wi wj)

+ (1 pij) log 1 pij 1 F(wi wj)

Hypothesis Testing for Generalized Thurstone Models

(i,j) E pij log pij F(w i w j )

+ (1 pij) log 1 pij 1 F(w i w j )

(i,j) E Bernoulli(pij) Bernoulli(F(w i w j )) 2 TV

(i,j) E (pij F(w i w j ))2,

where, in ζ1 we set c = F( 2b)(1 F( 2b)) and χ2( || ) denotes the χ2-divergence between two Bernoulli random variables and in ζ2 we utilize the fact that χ2(R||Q) DKL(R||Q) for two distributions R and Q and where DKL( || ) denotes the Kullback-Leibler (KL) divergence between two distributions, ζ3 follows since w are the optimal weights maximizing (8) for the TF model. Finally, ζ4 follows by Pinsker s inequality DKL(R||P) 2 R P 2 TV and thereby completing the proof.

A.4. Proof of Theorem 3.3

We begin by recalling the definition of ˆw arg minw Wb ˆl(w) in terms of the symmetrized probabilities qij defined in (17) as ˆl(w) = 2 X

(i,j) E+ ˆqij log F(wi wj) + (1 ˆqij) log(1 F(wi wj)).

Observe that since ˆw is an optimal solution and w is a feasible point for the problem in (7), therefore we have ˆl( ˆw) ˆl(w ). Moreover, since w is the optimal solution of a convex function l (w), therefore we have the optimality condition l (w )T(w w ) 0 for all w Wb. Now, by subtracting the quantity ˆl(w )T( ˆw w ) from both sides of ˆl( ˆw) ˆl(w ) gives

ˆl( ˆw) ˆl(w ) ˆl(w )T( ˆw w ) ˆl(w )T( ˆw w ) (18)

ζ1 ( ˆl(w ) l (w ))T( ˆw w )

ζ2 ˆl(w ) l (w ) L ˆw w L, (19)

where ζ1 follows since l (w )T(w w ) 0 for all w W and ζ2 follows from (Shah et al., 2016, Lemma 16) where L is the semi-norm induced by the Laplacian matrix L of graph G and L is the Moore-Penrose pseudoinverse of L. Now observe that by the chain rule, the Hessian of ˆl is given by

2ˆl(w) = 2 X

dt2 log(F(t))|t=w Txij + (1 ˆqij) d2

dt2 log(1 F(t))|t=w Txij

Since by our assumption that F(t) is α-strongly log-concave on the set [ 2b, 2b], this implies d2

dt2 log(F(t)) α. Moreover, since F( t) = 1 F(t), we also have d2

dt2 log(1 F(t)) α for all t [ 2b, 2b]. Therefore, for any v Rn

with v T1 = 0, we have v T 2ˆℓ(w )v 2α Xv 2 2 = 2α v 2 L.

Thus, by definition of strong-convexity, the left side of (18) can be lower bounded by α ˆw w 2 L. Therefore, utilizing the bound in (19), we obtain the following inequality

α ˆw w 2 L ˆl(w ) l (w ) L ˆw w L.

Canceling ˆw w L and squaring both sides leads to the following error bound on ˆw w L as

α2 ˆl(w ) l (w ) 2 L . (20)

Now, it remains to bound the term ˆl(w ) l (w ) L . Note that we can express the respective quantities as:

ˆl(w ) = 2 X

ˆqij F (w Txij)

F(w Txij) (1 ˆqij) F (w Txij) 1 F(w Txij)

Hypothesis Testing for Generalized Thurstone Models

(ˆqij F(w i w j ))F (w i w j ) F(w i w j )(1 F(w i w j )) xij, (21)

l (w ) = 2 X

(qij F(w i w j ))F (w i w j ) F(w i w j )(1 F(w i w j )) xij. (22)

Therefore, subtracting the two equations gives

ˆl(w ) l (w ) = 2 X

(ˆqij qij)F (w i w j ) F(w i w j )(1 F(w i w j ))xij = 2XTv, (23)

where v R|E|/2 is a vector formed by entries vij for (i, j) E+. and quantities vij are defined as

vij (ˆqij qij) F (w i w j ) F(w i w j )(1 F(w i w j )).

Note that the entries of the vector v are independent and have a mean of zero. Furthermore, we also have:

sup x [ 2b,2b]

F (x) F(x)(1 F(x)) β F( 2b)(1 F( 2b)) β.

Additionally, for any (i, j) E+, an application of the Hoeffding s inequality on ˆqij qij yields the following tail bound

t > 0, P(|ˆqij qij| > t) = P 1

m=1 (Zm ij pij) +

m=1 (Zm ji pji) > t

2 exp( 2kt2).

Consequently, v is a vector whose each entry is independent with zero mean and β2

4k -sub-gaussian. Now, observe that we can express ˆl(w ) l (w ) 2 L in quadratic form as

ˆl(w ) l (w ) 2 L = 4v TXL XTv. (24)

Now, combining (20) and (24) we can upper-bound E[ ˆw w 2 L] as

E[ ˆw w 2 L] 1

α2 E[ ˆl(w ) l (w ) 2 L ]

α2 E[v TXL XTv]

kα2 tr(XL XT) = (n 1) β2

where tr denotes the trace operator and we have tr(XL XT) = tr(L XTX) = tr(L L) = n 1. Hence, by an application of Hanson-Wright inequality (Rudelson & Vershynin, 2013) combined with usage of (20) and (24) as above, we have the following concentration bounds on ˆw w 2 L:

t > 0, P ˆw w 2 L n β2

kα2 > t 2 exp

β4 XL XT 2 F , tkα2

= 2 exp c min t2k2α4

β4(n 1) , tkα2

Hence, by a simple calculation, we can conclude that for some constant c, we have

for all t 1, P ˆw w 2 L > tcn β2

Hypothesis Testing for Generalized Thurstone Models

Bounding the pth moment: Let A denote the quantity: A = q

cn β2/kα2. Now the bound on the pth moment is obtained by integration and utilizing the tail bound in (26) as

E[ ˆw w p L] = p Z

0 tp 1P( ˆw w L > t) dt

0 tp 1P( ˆw w L > t) dt + p Z

A tp 1P( ˆw w L > t) dt

0 tp 1 dt + p Z

1 (At)p 1P( ˆw w L > t A)A dt

Ap + p Ap Z

Substituting the value of A in the above expression completes the proof.

B. Proof of Theorem 3.2

We begin by recalling the test statistic T from (15) as

k ij(k ij 1) + F( ˆwi ˆwj)2 2Zij

k ij F( ˆwi ˆwj).

where ˆw is calculated based on the data in Z1 and Zij are calculated based on the data in Z2. The expected value of T conditioned on the weights ˆw or equivalently conditioned on the data Z1 is given by

E[T | Z1] = X

" Zij(Zij 1)

k ij(k ij 1) |Z1

+ F( ˆwi ˆwj)2 2E

F( ˆwi ˆwj)

(i,j) E p2 ij + F( ˆwi ˆwj)2 2pij F( ˆwi ˆwj)

(i,j) E (pij F( ˆwi ˆwj))2, (27)

where in ζ we have utilized the fact that E h Zij(Zij 1)

k ij(k ij 1) |Z1 i = p2 ij. Hence, the expected value of T is given by

E[T] = E[E[T|Z1]] = X

(i,j) E E[(pij F( ˆwi ˆwj))2]

(i,j) E (pij F(w i w j ))2 + E[(F(w i w j ) F( ˆwi ˆwj))2]

+ 2E[(pij F(w i w j ))(F(w i w j ) F( ˆwi ˆwj))]

P F 2 F + E[ F ˆF 2 F] + 2 P F FE[ F ˆF F], (28)

where F, ˆF Rn n are matrices defined in Appendix A.1. In order to find bounds on estimation error (such as terms like E[ F ˆF p F]), we will utilize our simplifying assumption that kij = 2k for all (i, j) E. Now the ensuing lemma provides the bounds on the pth moments E[ F ˆF p F].

Lemma B.1 (pth Moment Bound). For matrices F and ˆF defined as in Appendix A.1 and p 1, there exists a constant cp, possibly dependent on p, α, β, b, such that the following bound holds on the pth moment of the Frobenius norm E F ˆF p F :

E h F ˆF p F i cp n

Moreover, there exists a constant c such that we have the following tail bound:

t 1, P F ˆF 2 F tcn

Hypothesis Testing for Generalized Thurstone Models

The proof is provided in Appendix B.1. Thus, utilizing Lemma B.1 and (28), we have obtain the following bound for some constant c1 and c2:

E[T] P F 2 F + c2 n k + 2c1

Let EH0[ ] and EH1[ ] denote the expectation operators under hypotheses H0 and H1, respectively. In essence, we have established the following bounds on EH0[T]: |EH0[T]| c2 n k , (29)

In a similar manner, we can obtain complementary lower bounds on EH1[T] (cf. (28)). Consequently, we have the following lower bound on EH1[T]:

EH1[T] P F 2 F 2c1

k P F F. (30)

Bounding variance: Now we will find bounds on var(T) under the two hypotheses. For this, we will make use of the law of total variance by conditioning T with respect to Z1 as

var(T) = E[var(T | Z1)] + var(E[T | Z1]). (31)

First, we will examine the term E[var(T|Z1)]. Note that conditioned on Z1, the term F( ˆwi ˆwj)2 is constant and does not contribute to var(T|Z1). Moreover, Zij for (i, j) E are mutually independent, and hence, we can analytically find the expression for var(T|Z1) as

var(T | Z1) ζ1= X

(i,j) E var

k ij(k ij 1)

+ 4F( ˆwi ˆwj)2var

4F( ˆwi ˆwj)

E Z2 ij(Zij 1)

(k ij)2(k ij 1) E[Zij(Zij 1)]

k ij(k ij 1) E[Zij]

2p2 ij + 4(k ij 2)p3 ij + (6 4k ij)p4 ij k ij(k ij 1) + 4F( ˆwi ˆwj)2pij(1 pij)

4F( ˆwi ˆwj)2(p2 ij p3 ij) k ij ,

where ζ1 follows from the variance of sum technique and ζ2 follows from the expressions for the first four moments of Binomial random variables and some basic algebra. Now, in order to bound E[var(T | Z1)], we will substitute all k ij = k for all (i, j) E and simplify the above expression as:

E[var(T | Z1)] = X

2p2 ij 4p3 ij + 2p4 ij k(k 1)

+ pij(1 pij) 4p2 ij k + 4E[F 2( ˆwi ˆwj)]

k 8E[F( ˆwi ˆwj)]pij

2p2 ij(1 pij)2

k(k 1) + 4pij(1 pij)

k (pij E[F( ˆwi ˆwj)])2

ndmax 8k(k 1) + 1

k P E[ˆF] 2 F

ndmax 8k(k 1) + 1

k ( P F F + F E[ˆF] F)2

ndmax 8k(k 1) + 1

k ( P F F + E[ F ˆF F])2

ndmax 8k(k 1) + 1

Hypothesis Testing for Generalized Thurstone Models

where dmax = maxi [n] P

j [n]\i Eij is the maximum degree of the graph, and the last inequality follows from Lemma B.1. Now we will bound the second term of (31), i.e., var(E[T|Z1]). Recall that by (27), we have E[T|Z1] = P

(i,j) E(pij F( ˆwi ˆwj))2. Therefore, we upper bound var(E[T | Z1]) as

var(E[T | Z1]) = var X

(i,j) E (pij F( ˆwi ˆwj))2 = var( P ˆF 2 F)

= E[ P ˆF 4 F] E[ P ˆF 2 F]2

E[( P F F + F ˆF F)4] E[( P F F F ˆF F)2]2,

where the last inequality follows from the triangle inequality in the first term and the reverse triangle inequality on the second term. Under hypothesis H0, the above expression simplifies trivially as

var H0(E[T | Z1]) c4

where var Hl( ) denotes the variance under hypothesis l for l {0, 1}. Now, we turn our attention to bounding var H1(E[T | Z1]). This bound can be established through a relatively mechanical process described as follows

var H1(E[T|Z1]) P F 4 F + 4 P F 3 FE[ F ˆF F] + 6 P F 2 FE[ F ˆF 2 F]

+ 4 P F FE[ F ˆF 3 F] + E[ F ˆF 4 F]

( P F 2 F + E[ F ˆF 2 F] 2 P F FE[ F ˆF F])2

= 8 P F 3 FE[ F ˆF F] + 4 P F 2 F(E[ F ˆF 2 F] E[ F ˆF F]2)

+ 4 P F F(E[ F ˆF 3 F] + E[ F ˆF 2 F]E[ F ˆF F])

+ E[ F ˆF 4 F] E[ F ˆF 2 F]2

8c1 P F 3 F

k + 4c2 P F 2 F n k

4(c3 + c2c1) P F F n

Thus, by combining Equations (32) and (33) and (34) we obtain the following bounds on var H0(T) and var H1(T) based on (31)

var H0(E[T]) ndmax 8k(k 1) + c2 1 n k2 + c4

var H1(E[T]) ndmax 8k(k 1) + 1

2 + 8c1 P F 3 F

+ 4c2 P F 2 F n k + 4 c3 P F F

We define the decision rule as follows: select hypothesis H1 if the test statistic T exceeds the threshold:

Select H1 if : T > γ n

k + c2 n k , (37)

where γ is a suitably chosen constant (selected below). Consequently, we can employ the one-sided Chebyshev s inequality

Hypothesis Testing for Generalized Thurstone Models

to bound the probability of error under hypothesis H0, yielding:

PH0 T > γ n

= PH0 T EH0[T] > γ n

k + c2 n k EH0[T]

PH0(T EH0[T] > γ n

var H0(T) var H0(T) + γ2( n

4k2 + c2 1 n k2 + c4( n

4k2 + c2 1 n k2 + c4( n

k )2 + γ2( n

4n + c2 1 1 n + c4 dmax

4n + c2 1 1 n + c4 + γ2 1

where the last bound holds by the fact that dmax n, and for an appropriate constant γ max{4 c4, 4, 4c1/ n}.

Now, we will find an upper bound on the probability of error under hypothesis H1. Observe that an error is made under H1 if the value of the test statistic T γ n

T EH1[T] γ n

k + c2 n k EH1[T]

ζ1 P T EH1[T] γ n

k + c2 n k + 2c1

k P F F P F 2 F

ζ2 var H1(T) var H1(T) + (D2 )2

where ζ1 follows from (30), ζ2 follows by one-sided Chebyshev inequality and defining D = P F F and = γ n

k + 2c1 p n

k D. The step ζ3 holds if 4var H1(T) (D2 )2 or equivalently if D2 2 p

var H1(T) + . Using the sub-additivity of operator, the following condition necessitates that for this to be true:

4 + 2 c2D rn

k + c2 n k + 2c1

Substituting D = a0 p n

k in the above expression, for some a0, we obtain:

a2 0 dmax n + 2(a0 + c1)

2c1a3/2 0 + 4 c2a0 + 4 p

2 c3a0 + 2c4 + γ + c2 + 2a0c1.

Now, we can conclude that the above expression is true for some large enough constant a0 independent of n and k, thus establishing that for D = a0 p n

k and a0 large enough our decision rule achieves a type I and type II sum error of at most 1/2. Utilizing Theorem 3.1 we obtain that infw Wb P F(w) F = Θ( P F F). Combining this fact along with the definition of critical threshold (cf. (14)), we have the following bound on εc:

This completes the proof.

Notably, from the above result, we have that εc 0 as n or k goes to infinity. Therefore, for any fixed n and ϵ > 0, our decision rule is guaranteed to achieve a non-trivial minimax risk (strictly less than 1) for any pairwise comparison model P in the class M0 or M1(ϵ) as long as the number of observed samples for each pair (i.e. k) are sufficiently large.

Hypothesis Testing for Generalized Thurstone Models

Moreover, there do exist pairwise comparison models {pij : (i, j) E} whose (normalized) separation is constant with n. Consequently, for such models, we can argue that for any fixed k and n large enough, our decision rule will achieve a non-trivial minimax risk. One such example of a pairwise comparison model represented by its pairwise comparison matrix (on a complete graph) is

2 + η (11T I), for any η 0, 1

It is easy to verify that for this comparison model, we must have infw Wb 1 n P F(w) F η. This is because any matrix F(w) must satisfy the constraint (F)ij + (F)ji = 1 for every i = j, which immediately leads to the lower bound of η on the separation distance.

B.1. Proof of Lemma B.1

Observe that by definition of F and ˆF, we have X

(i,j) E (F(w i w j ) F( ˆwi ˆwj))2 β2 X

(i,j) E (|(w i w j ) ( ˆwi ˆwj)|)2

4β2 ˆw w 2 L. (38)

Taking the power p/2 on both sides and then taking the expectation, we obtain

E[ F ˆF p F] 2pβp E[ ˆw w p L] cp n

where the last inequality follows by plugging in the bounds on pth moment from Theorem 3.3 and absorbing the constants α, β in cp. The tail bound follows directly from (38).

C. Proof of Theorem 3.4

Without loss of generality, we assume that the graph G is Eulerian. If not, we can reduce the problem to an Eulerian graph by considering the largest Eulerian spanning sub-graph G of G so that every vertex of G has even degree, which exists by our assumption that G is super-Eulerian.

Under the null hypothesis, we assume that the pairwise comparison model P corresponds to equal utilities for all items, i.e., pij = 1

2, (i, j) E and let P0 denote the probability measure corresponding to this pairwise comparison model P. Under the alternative hypothesis, we will set our pairwise comparison model R to be a perturbed version of P (sharing the same observation graph G). Specifically, every perturbation will have the following property:

(i, j) E, rij = 1

2 + ηbij, where η 0, 1

2 δ , bij { 1, 1}, bij + bji = 0, i [n],

j:(i,j) E bij = 0, i [n], (39)

where bij represents the signs of the perturbation by parameter η. Note that we set bij = 0 for (i, j) / E. Let any such sequence of perturbations bij is represented by a matrix B { 1, 0, 1}n n.

As we delve deeper into the perturbation structure, we will carefully select a subset of perturbations B satisfying the constraints in (39), as well as additional constraints to be specified later

B bij { 1, 1} for (i, j) E : bij + bji = 0, X

j:(i,j) E bij = 0, i [n] . (40)

Based on this selection, under the alternative hypothesis, let the pairwise comparison model R be generated from a mixture distribution such that R = P + ηB and B Unif(B), i.e., R is generated by adding the perturbation sequence selected uniformly at random from B. Let PB denote the measure corresponding to the overall mixture distribution.

As we examine the perturbation structure, we make our first observation: for any perturbation B satisfying (39), the corresponding pairwise comparison model R belongs to the class M1(ϵ), for some ϵ as a function of η. Specifically, we will show that the perturbation B guarantees a minimum separation distance of ϵ from the class of M0.

Hypothesis Testing for Generalized Thurstone Models

Bounding separation: Our first observation is that any such perturbation R = P + ηB has a sufficiently large (and more importantly, tractable) separation distance. In order to lower bound this separation distance, we will utilize Theorem 3.1. But first, we need to find the optimal TF weights w (as in (8)). This is addressed in the following lemma.

Lemma C.1 (Optimal Weights for Perturbed Matrix). For the TF model and for any B B defined as in (40), the perturbed pairwise comparison matrix P + ηB has a unique optimal TF weights given by w = 0 (all zeros vector).

The proof is provided in Appendix C.2. We now utilize Theorem 3.1 to obtain the following lower bound on separation distance as

inf w Wb P + ηB F(w) F c1 P + ηB F(0) F

where c1 is the lower bound constant in Theorem 3.1 and the last equality follows sinde (P)ij = (F(0))ij = 1/2 for all (i, j) E. Therefore, we have nϵ c1η p

|E| by definition of M1(ϵ).

Having established a lower bound on the separation distance for each of the perturbations, our next step is to carefully select a special subset B of perturbations that allows us to approximate the degrees of freedom in the structure of our perturbation set, while also taking into account the constraints imposed by the graph topology.

To this end, we leverage the assumption that our observation graph G is Eulerian, meaning every node has an even degree. This property enables us to decompose G into a collection of edge-disjoint cycles, denoted by C. In addition, we introduce a comparison incidence graph GI, which represents the comparison structure as an undirected bipartite graph. This graph has n item nodes on one side and |E|/2 nodes on the other side, each representing a pairwise comparison (i, j) E for j > i. The edges in GI connect items to their respective comparison nodes. Since every node in G has an even degree, this ensures that the incidence graph GI is Eulerian, and therefore GI also has a cycle decomposition denoted by CI. Notably, each cycle in GI is of even length and we can establish a one-to-one correspondence between the cycles in C and CI. Now, we orient the edges in the undirected comparison incidence graph GI based on the values bij in the perturbation B. Specifically, we will orient the edges in GI as follows: if bij = 1, the edge will point from the item node to the comparison node for pair (i, j), and if bij = 1, the edge will have the opposite direction. The constraints in (40) ensure that each node in GI has equal in-degree and out-degree.

To specify the construction of B, we consider any fixed cycle decomposition CI (since it may not be unique). Let the number of cycles in the cycle decomposition be denoted by |CI|. Let σi CI represents the ith cycle in CI and |σi| denotes the length of this ith cycle. Observe that we can independently orient the edges of any cycle σi CI in either clockwise or counterclockwise direction, yielding 2|CI| distinct Eulerian orientations for GI. We then construct the structured collection of perturbations B by associating each perturbation with one of the 2|CI| distinct Eulerian orientations of the cycle decomposition CI. This establishes a one-to-one correspondence between valid perturbations in B and distinct Eulerian orientations of CI. Thus, in summary we define B correspoding to decomposition CI as

B bij { 1, 1} : bij + bji = 0, (i, j) E, X

j:(i,j) E bij = 0, i [n],

|bl bl+1| = 2, l σi, σi CI

where, l is used for indexing sequential edges of the cycle σi.

Bounding risk: Now, we will utilize the Ingster-Suslina method to compute lower bound on R(G, ϵ) (cf. (13)). The standard testing inequality by Le Cam (Ingster & Suslina, 2003) states that

R(G, ϵ) 1 P0 PB TV 1 p

χ2(PB||P0). (41)

We calculate the chi-squared divergence χ2(P0||PB) by expressing it as an expectation with respect to two independent pairwise models corresponding to permutations B and B drawn independently and uniformly at random from B as

χ2(PB||P0) = EB,B Unif(B)

Hypothesis Testing for Generalized Thurstone Models

We will now leverage the tensorization property of 1+χ2(P||Q), which enables us to decompose the chi-squared divergence between product distributions into a product of individual divergences. Specifically, for distributions P1, Q1, . . . , Pn, Qn, we have

i=1 (1 + χ2(Pi||Qi)).

Consequently, the chi-squared divergence χ2(PB||P0) simplifies as

1 + χ2(PB||P0) =

EB,B Unif(B)

2 + ηbij)m( 1

2 ηbij)k m k m ( 1

2 + ηb ij)m( 1

2 ηb ij)k m k m 1

= EB,B Unif(B)

2 + ηbij)m( 1

2 ηbij)k m( 1

2 + ηb ij)m( 1

2 ηb ij)k m 1

We direct our attention to the (i, j)th term of the product in (42), for (i, j) E and denote it as h(bij, b ij)

h(bij, b ij) =

2 + ηbij)m( 1

2 ηbij)k m( 1

2 + ηb ij)m( 1

2 ηb ij)k m 1

Now since we have bij, b ij { 1, 1}, therefore whenever bij and b ij agree, by (43) we have h(1, 1) = h( 1, 1). And moreover, we can calculate h(1, 1) as

h(1, 1) = 2k k X

4 + η2 + η m 1

4 + η2 η k m

= 1 + 4η2 k k X

2 + η 1 2 + 2η2

2 η 1 2 + 2η2

= (1 + 4η2)k.

Additionally, by (43) we also have h(1, 1) = h( 1, 1) and it simplifies to

h(1, 1) = 2k k X

2 η 2k = (1 4η2)k.

For any two perturbations B, B Unif(B), let random variables A1 denotes the number of agreements between B, B

respectively, i.e., number of (i, j) E+ where bij = b ij in randomly drawn permutation B and B . And similarly let A2 denotes the number of disagreements between B, B i.e., number of (i, j) E+ where bij = b ij. Consequently, we obtain

1 + χ2(PB||P0) = EB,B Unif(B)

h(1, 1)2A1h(1, 1)2A2

= EB,B Unif(B)

(1 + 4η2)2k A1(1 4η2)2k A2

EB,B Unif(B)

exp(8η2k(A1 A2)) . (44)

Hypothesis Testing for Generalized Thurstone Models

In addition, we define vectors a, a { 1, 1}|CI| to represent the orientations of the |CI| cycles in GI induced by B. The subsequent calculation will now be used to complete the proof:

χ2(PB||P0) + 1 ζ1 1 22|CI| X

B,B exp 8η2k(A1 A2) ζ2= 1 22|CI| X

σi CI |σi|aia i

σi CI exp 8η2k|σi|aia i #

σi CI E exp 8η2k|σi|aia i

2 exp 8η2k|σi| + 1

2 exp 8η2k|σi|

exp 32η4k2(|σi|)2 = exp

σi CI |σi|2 !

where ζ1 follows from (44) and the fact that there are 2|CI| perturbations which are sampled uniformly from B, ζ2 follows from definition of ai and the fact that number of agreements/disagreements can be represented in terms of the agreements/disagreements of the cycle orientations ai, a i. ζ3 follows from the fact that the orientations of the cycles are independent of one another. Finally, utilizing the fact that c1η p

|E| nϵ and by combining the resulting bound along with (41) and the fact that cycle lengths in GI are twice the size in G completes the proof.

C.1. Proof of Proposition 3.5

Part 1: For a complete graph, the comparison incidence graph GI has n item nodes and n(n 1)

2 comparison nodes. When n is odd, all nodes have an even degree equal to n 1; therefore, G is Eulerian. Notably, n can take the forms n = 6m + 1, n = 6m + 3, and n = 6m + 5, where m N. As established by (Kirkman, 1847), for n = 6m + 1 and n = 6m + 3, G can always be decomposed into cycles of length 3. Meanwhile, for n = 6m + 5, G can be decomposed into a cycle of length 4 and remaining cycles of length 3 (Feder & Subi, 2012). Therefore, we have |E|2 = O(n4) and P

σ C |σ|2 = O(n2), giving ε2 c = Ω(1/nk)

Part 2: Consider a d-regular graph with constant even degree d. The associated comparison incidence graph has n item nodes and nd/2 comparison nodes. By applying (Seshadri & Ugander, 2019, Lemma 9), we can decompose the edge set of the comparison incidence graph into cycles of size at most 2 log2(n) , with at most min{2n + nd, 4n} = 4n edges remaining. Since the graph is Eulerian, removing cycles does not affect this property. Therefore, the remaining 4n edges can be further decomposed into cycles of length at most 2n (since cycles can have a maximum length of 2n, and this reflects a worst-case scenario). This yields P

σ C |σ|2 = O(n2), which in turn implies ε2 c = Ω(1/n2k).

Part 3: For graphs comprising a single cycle, it is easy to verify that the number of cycles is 1 and the cycle has a length of n.

Part 4: For a toroidal grid of size n n, we can generate a cycle decompostion of G consisting of n horizontal edges and n vertical edges. Clearly, each of these edges has a length of n. Therefore, P σi C |σi|2 = 2n n. And since it is a toroidal grid we have |E| = 2n. Plugging in these values we obtain ε2 c = Ω(1/n7/4k).

C.2. Proof of Lemma C.1

To find the optimal weights w , our objective is to solve the following optimization problem with parameter b :

l (w) = min w Wb X

(i,j) E rij log(F(wi wj)) + (1 rij) log(1 F(wi wj)).

Our initial observation is that, due to the skew-symmetrization of the model rij + rji = 1, we can express the gradient of l (w) as:

( l (w))i = 2 X

j:(i,j) E (rij F(wi wj)) F (wi wj) F(wi wj)(1 F(wi wj)).

Hypothesis Testing for Generalized Thurstone Models

Furthermore, the gradient is zero at w = 0. To see this, note that for all i [n], we have:

( l (w))i|w=0 = 2 X

j:(i,j) E (1

2 + ηbij F(0)) F (0) F(0)(1 F(0)) = 8ηF (0) X

j:(i,j) E bij = 0,

where the last step is followed by our construction of the perturbation sequence in (39). Considering that the gradient is zero at w = 0 and the optimization objective is convex over Wb (in fact strongly convex over Wb), coupled with the uniqueness of the optimal TF weights as indicated by Proposition 2.4, we conclude that w = 0 is indeed the optimal and unique solution for any b 0.

D. Proofs of Time-Uniform Bounds on Probabilities of Type I and II Errors

In this appendix, we will establish bounds on type I and II error probabilities as described in Theorem 3.7. First, we will introduce essential notation to facilitate our analysis and present our proof of Proposition 3.6. Then, we will establish a few auxiliary lemmata such as which are needed to derive the bounds on type I and type II errors. Finally, combining these results, we will proof of Theorem 3.7 in Appendix D.3 and a few corollaries based on these results in Appendix E.

Additional notation: We introduce Y l ij for l N and (i, j) E to denote the observed comparisons that are used for estimating Zij = Pk l=1 Zk1+l ij , i.e., we let Y l ij = Zk1+l ij for l [k]. Also, define the statistic Y k ij Pk m=1 Y m ij . Moreover, define 1n as an all-ones vector of length n and In as the identity matrix of size n n.

D.1. Proof of Proposition 3.6

We will focus on the following sequence { T k ij, k N \ {1}} defined as

T k ij Y k ij Y k ij 1

k(k 1) + b2 ij 2bij Y k ij k .

Note that with bij = F( ˆwi ˆwj), the term T k ij reduces to the (i, j)th term of the test statistic T k1,k (based on the notation defined above) and we will now show that it is indeed a reverse martingale. To do this, we will demonstrate that both the

terms Y k ij( Y k ij 1) k(k 1) and Y k ij k are indeed reverse martingales. First, we focus on the former term. Observe that we can write the product Y k ij Y k ij 1 as

Y k ij Y k ij 1

k(k 1) = (Pk m=1 Y m ij )2 Pk m=1 Y m ij k(k 1) = 1 k(k 1)

l,m=1:l =m Y l ij Y m ij ,

where the last equality follows because Pk m=1 Y m ij Y m ij 1 = 0 as Y m ij {0, 1}. Also, observe that E Y m ij Y l ij | Fk+1 = E Y m ij Y r ij | Fk+1 for l = m = r and where Fk is the canonical reverse filtration defined as the sigma algebra generated

(i,j) E σ Y k ij k , Y k+1 ij , Y k+2 ij , . . . . This is because for any set A Fk+1 and l, m, r [k] and l = m = r, we

have E Y m ij Y l ij1A = E Y m ij Y r ij1A and E Y m ij 1A = E Y l ij1A .

Utilizing the above relations, we can show that Y k ij( Y k ij 1) k(k 1) is indeed a reverse-martingale as:

" Y k ij Y k ij 1

k(k 1) | Fk+1

"Pk l,m=1:l =m Y l ij Y m ij k(k 1) | Fk+1

Pk+1 l,m=1:l =m E Y l ij Y m ij | Fk+1

"Pk+1 l,m=1:l =m Y l ij Y m ij k(k + 1) | Fk+1

" Y k+1 ij Y k+1 ij 1

k(k + 1) |Fk+1

= Y k+1 ij Y k+1 ij 1

Hypothesis Testing for Generalized Thurstone Models

where the last equality follows since Y k+1 ij is measurable with respect to Fk+1. Similarly, we can also show that Y k ij k is also a reverse martingale as:

" Y k ij k |Fk+1

Pk m=1 E Y m ij | Fk+1

Pk m=1 E Y m ij | Fk+1

" Y k+1 ij k + 1|Fk+1

Finally, the proposition follows by the linearity of conditional expectation and substituting bij = F( ˆwi ˆwj) as:

E T k1,k | Fk+1 = E

(i,j) E T k ij | Fk+1

" Y k ij Y k ij 1

k(k 1) | Fk+1

+ b2 ij 2bij E

" Y k ij | Fk+1

Y k+1 ij Y k+1 ij 1

k(k + 1) + b2 ij 2bij Y k+1 ij k + 1 = X

(i,j) E T k+1 ij = T k1,k+1.

This completes the proof.

D.2. Intermediate Lemmata

In order to derive the proof of Theorem 3.7, we will first prove the following intermediate lemma that gives bounds on type I and type II errors where the threshold is a function of the estimated weights ˆw. Additionally, the lemma relies on a variant of the Hanson-Wright inequality that is time-uniform (see Lemma D.2) and specialized to our specific setting.

Lemma D.1 (Conditional Bounds on Type I and II Error Probabilities). For any α (0, 1] and for ˆw estimated from the first k1 pairwise comparison for each pair in E, there exist a constant c such that for ν (0, 1/e), the following bounds hold under hypothesis H0 and H1, respectively:

k 2, T k1,k F ˆF 2 F + c

|E| k ℓk,ν + 4 F ˆF F

k 2, T k1,k P F 2 F F ˆF 2 F 2 F ˆF F P F F

|E| k ℓk,ν 4 P F F + F ˆF F

where ℓk,ν = log 3.5 log2(k)2/ν .

Proof: Part 1: We will first prove the bound under hypothesis H0. Based on the additional notation defined at the beginning of the Appendix, we can simplify the (i, j)th term T k1,k ij of the test statistic T k1,k as:

T k1,k ij = Y k ij( Y k ij 1) k(k 1) + F( ˆwi ˆwj)2 2F( ˆwi ˆwj) Y k ij k

Pk l,m=1:l =m Y m ij Y l ij k(k 1) + F( ˆwi ˆwj)2 2F( ˆwi ˆwj)m l=1Y m ij k

ζ1= (yk ij)TA(k)yk ij + 1T k A(k)1k F(w i w j )2 2F(w i w j )1T k A(k)yk ij+

(F( ˆwi ˆwj) F(w i w j ))2 2(F( ˆwi ˆwj) F(w i w j )) Y k ij k F(w i w j )

Hypothesis Testing for Generalized Thurstone Models

ζ2= (yk ij F(w i w j )1k)TA(k)(yk ij F(w i w j )1k) | {z }

+(F( ˆwi ˆwj) F(w i w j ))2

2(F( ˆwi ˆwj) F(w i w j )) Y k ij k F(w i w j )

where in ζ1 we have A(k) 1k1T k Ik k(k 1) and yk ij Rk is a vector such that yk ij = [Y 1 ij, . . . , Y k ij] and in ζ2 the term I1,k ij follows by observing that

yk ij F(w i w j )1k TA(k) yk ij F(w i w j )1k =

(yk ij)TA(k)yk ij + 1T k A(k)1k F(w i w j )2 2pij1T k A(k)yk ij. (45)

Now, we will upper bound the quadratic variation term P

(i,j) E I1,k ij . For this we will utilize Lemma D.2 to obtain the tail bounds for any ν (0, 1/e) to obtain (for some constant c):

(i,j) E I1,k ij > c

It is straightforward to show that P

(i,j) E I2,k ij is (4 F ˆF 2 F/k)-sub-gaussian. Therefore, by utilizing a time-uniform version of Hoeffding inequality (Manole & Ramdas, 2023, Corollary 8) for the user-defined h(k) = (πk)2/6 (also used for the stitching argument in the proof of Lemma D.2), we obtain for any ν (0, 1)

(i,j) E I2,k ij > 2 F ˆF F

2log(h(log2(k))) + log(2/ν)

Combining the two tail bounds and a simple calculation completes the proof for type I error.

Part 2: Observe that under hypothesis H1 we have

T k ij (pij F(w i w j ))2 = Y k ij( Y k ij 1) k(k 1) p2 ij + F( ˆwi ˆwj)2 F(w i w j )2

+ 2 F(w i w j )pij F( ˆwi ˆwj) Y k ij k

ζ1= (yk ij)TA(k)yk ij 1T k A(k)1kp2 ij + F( ˆwi ˆwj)2 F(w i w j )2

+ 2pij(F(w i w j ) F( ˆwi ˆwj)) + 2 pij Y k ij k

F( ˆwi ˆwj), (46)

where in ζ1 we have Ak 1k1T k Ik k(k 1) and yk ij Rk is avector yk ij = [y ij, . . . ., yk ij] as before. Now, observe that the term

(yk ij)TA(k)yk ij 1T k A(k)1kp2 ij = yk ij pij1k TA(k) yk ij pij1k

+ 2pij yk ij pij1k TA(k)1k

= yk ij pij1k TA(k) yk ij pij1k + 2pij

Y k ij k pij

Hypothesis Testing for Generalized Thurstone Models

Substituting the above bound in (46), we obtain

T k1,k ij (pij F(w i w j ))2 = yk ij pij1k TA(k) yk ij pij1k

+ 2(pij F( ˆwi ˆwj))

Y k ij k pij

I4,k ij + (F( ˆwi ˆwj) F(w i w j ))(F( ˆwi ˆwj) + F(w i w j ) 2pij) | {z }

Now the term P

(i,j) E I3,k ij is bounded by utilizing Lemma D.2 to obtain the tail bounds for some constant c

(i,j) E I3,k ij < c

And the term P

(i,j) E I4,k ij is bounded by utilizing adaptive Hoeffding s inequality (Manole & Ramdas, 2023, Corollary 8) to obtain the tail bounds

(i,j) E I4,k ij < 4 P ˆF F

An application of the triangle inequality to the above equation yields

(i,j) E I4,k ij < 4( P F F + F ˆF F)

Finally, the term P

(i,j) E I5,k ij is bounded using Cauchy-Schwarz inequality as:

(i,j) E I5,k ij = X

(i,j) E (F( ˆwi ˆwj) F(w i w j ))2

(i,j) E 2(F( ˆwi ˆwj) F(w i w j ))(F(w i w j ) pij)

F ˆF 2 F 2 F ˆF F F P F.

Combining the three bounds for I3,k ij , I4,k ij , I5,k ij completes the proof for type II error.

Lemma D.2 (Time-Uniform Quadratic Bound). For any element v in a finite set V, consider a sequence of independent random variables x(1) v , x(2) v , x(3) v , . . . such that x(l) v Bernoulli(pv) for l N and some pv (0, 1). Define the random vector x(k) v = x(1) v , x(2) v , . . . , x(k) v Rk and let {A(k) Rk k : k N} be a sequence of matrices such that

A(k) = (1k1T k Ik)/(k(k 1)). Then, there exists a constant c > 0 such that for all ν (0, 1/e), we have

v V ( x(k) v pv1k)TA(k)( x(k) v pv1k) > c

|V| k log 3.5 log2 2(k)/ν !

Proof: Denote s(k) v = ( x(k) v pv1k)TA(k)( x(k) v pv1k). Recall that we had shown that {s(k) v : k N \ {1}} forms a reverse-martingale with respect to canonical reverse filtration {σ Pk m=1 x(m) v , x(k+1) v , x(k+2) v , . . . : k N \ {1}}, i.e., for k 2, we have

E ( x(k) v pv1k)TA(k)( x(k) v pv1k)|Fk+1 = ( x(k+1) v pv1k+1)TA(k+1)( x(k+1) v pv1k+1).

Hypothesis Testing for Generalized Thurstone Models

This fact follows from an expansion of the corresponding terms (similar to (45)), and then the proof follows as in Appendix D.1. Now, for any v V, we define a richer class of filtration known as exchangeable filtration { Fv k : k N\{1}} (Durrett, 2019), which denotes the σ-algebra generated by all real-valued Borel-measurable functions of x(1) v , x(2) v , x(3) v , . . . which are permutation-symmetric in the first k arguments. It follows directly that s(k) v is also a reverse-martingale with respect to Fv k . Therefore, by (Manole & Ramdas, 2023, Theorem 4), we have that the mapping x exp(λx) for λ (0, ), when applied to s(k) v , yields a reverse-submartingale with respect to filtration Fv k. Define the product σ-algebra Fk = N v V Fv k. Now, for any k0 2 and k0 N, we have that

v V s(k) v u

= P k k0 : eλ P

v V s(k) v eλu

E h exp λ P v V s(k0) v i

where the last step follows from Ville s inequality for nonnegative reverse submartingales (Manole & Ramdas, 2023, Theorem 2). Also note that s(k) v 1 for all k with probability 1, therefore E[eλs(k) v ] always exists. Now note that

v V ( x(k) v pv1k)TA(k)( x(k) v pv1k) = X

i,j [k]:i =j

(x(i) v pv)(x(j) v pv) k(k 1)

= ( x(k) V )TA(k) V ( x(k) V ),

where ( x(k) V )T = [( x(k) v1 1kpv1)T, . . . , ( x(k) v|V| 1kpv|V|)T] is a vector in Rk|V| formed by concatenating the vectors x(k) vi for all vi V and i [|V|]. And A(k) V R|V|k |V|k formed by stacking the matrices A(k) v as a diagonal block structure. Now by (Rudelson & Vershynin, 2013, Theorem 1.1), we have that there exists constants c, c such that we have

v V s(k0) v i exp( cλ2 A(k0) V 2 F) for λ c / A(k0) V 2,

where 2 denotes the spectral norm. Therefore, we have that

v V s(k) v u

exp( λu + cλ2 A(k0) V 2 F) for λ c / A(k0) V 2.

Now, by optimizing over λ and for some constant c, we can conclude that

v V s(k) v u

A(k0) V 2 F , u

By the block structure of A(k0) V , we have the following

A(k0) V 2 F = |V| k0(k0 1), and A(k0) V 2 = max v V A(k0) v 2 = 1

Substituting, the above values in (47) we obtain that for all ν (0, 1/e), we have the following result for some constant c:

v V s(k) v c

|V| log(1/ν)

The rest of the proof follows by a stitching argument (Zhao et al., 2016) and is provided below for completeness. For any

Hypothesis Testing for Generalized Thurstone Models

ν (0, 1/e), define a function h(k) = (πk)2

6 . Now, observe that

v V s(k) v c

|V| k log h(log2 k)

v V s(k) v c

|V| 2l log h(l)

ν h(l) = ν.

Finally, the statement in the lemma follows by a simple calculation.

D.3. Proof of Theorem 3.7

Fix t 1, and recall that from Lemma B.1, for some constant c0 we have that with probability at most e t, F ˆF F p

c0tn/k1. This implies that with probability at least 1 e t, we have

F ˆF 2 F + c

|E| k ℓk,ν + 4 F ˆF F

k1 + c|E| 1 2 k ℓk,ν + 4

Therefore, by using a basic union-bound argument to the bound under null hypothesis in Lemma D.1, we have

k 2, T k1,k c0tn

k1 + c|E| 1 2 k ℓk,ν + 4

The bound on type II error follows by a similar argument. First, using basic algebra, and using D to denote P F F, we obtain from the bound under H1 in Lemma D.1 that:

k 2, T k1,k (D F ˆF F)2 c|E| 1 2 k ℓk,ν 4(D + F ˆF F)

Now utilizing the fact that the D p

c0tn/k1 and the same union bound technique as above, we obtain that

k 2, T k1,k D r

Since the above bounds for any pairwise comparison matrix P satisfying Assumption 2.3 and since nϵ = Θ(D) by Theorem 3.1, hence we can taking supremum with respect to pairwise comparison P in class M0, M1( p

c0t/(nk1)) on type I and type II error bounds respectively, and thus completing the proof of the theorem.

E. Simplified Expressions for Type I and II Error Probability Bounds

We obtain the following corollaries by plugging in the parameter values in Theorem 3.7 for a complete graph and a single cycle on n nodes. Corollary E.1 (Type I and II Error Probability Bounds for Complete Graph). For the setting in Theorem 3.7 assume that we have a complete graph on n nodes, then there exists (different) constants c1, c2, c3, c4, c5 > 0 such that for ϵ c3/ nk1, such that for all ν (0, 1/e) and t 1, we have

k 2, T k1,k c1tn

k1 + c2nℓk,ν

k 2, T k1,k D c4

Hypothesis Testing for Generalized Thurstone Models

Corollary E.2 (Type I and II Error Probability Bounds for Single Cycle Graph). For the setting in Theorem 3.7, assume that we have a single cycle graph on n nodes, then there exists (different) constants c1, c2, c3, c4, c5 > 0 such that for ϵ c3/ nk1, such that for all ν (0, 1/e) and t 1, we have

k 2, T k1,k c1tn

k1 + c2 nℓk,ν

k 2, T k1,k D c4

F. Experimental Details and Additional Experiments

In this appendix, we begin by providing additional details of the experiments corresponding to Section 4 in Appendix F.1. Furthermore, we perform several additional experiments in Appendix F.2 to support and extend our main findings. These include: (i) performance comparisons with the spectral method of (Makur & Singh, 2023) for the BTL model (see Appendix F.2.1), (ii) analysis of type I and II errors for different thresholding strategies (see Appendix F.2.2) and empirical validation of our threshold estimation approach under different settings (e.g., clustered versus randomly sampled skill scores), and (iii) experiments on real-world datasets (see Appendix F.2.3). Finally, we illustrate a method of computing confidence intervals in Appendix F.3 based on the expression in Proposition 3.8.

F.1. Additional Details for Experiments

Below we provide additional details on the experimental setup and methodology for the experiments in Section 4.

Error bars and estimation of ˆw: To estimate the 95th quantile of the test statistics T, we used 400 samples. The error bars were based on the two-sided distribution-free conservative estimates presented in (Hahn & Meeker, 1991). Specifically, for the 95th quantile, the upper 96% confidence interval was computed as the 97th quantile of the computed tests T, and the lower confidence interval was computed as the 92.5% quantile. Moreover, in all of our experiments, the estimation of ˆw was performed using a standard gradient descent algorithm with a learning rate of 0.01 and for a maximum of 3000 iterations, or until the norm of the gradient was less than 10 5.

Testing on LMSYS dataset: In our experiment on the LMSYS dataset, we used a maximum of 200 samples per pair and discarded pairs with fewer than 30 observed comparisons to reduce the imbalance in the data across pairs. The observation graph was a complete graph except for the larger values of n, which had a few edges missing.

Computational overhead of empirical quantile approach: While computing the threshold using the empirical quantile approach can be expensive due to repeated simulations, we remark that the procedure is highly parallelizable and is fast even with a na ıve implementation (without any parallelization); indeed, each of our experiments can be done within 5 minutes on a normal CPU for n as large as 60. Additionally, to speed up computing ˆw on the simulated dataset, one can initialize the iterates at the optimal values that generated the simulated dataset and then apply a few iterations of gradient descent. Moreover, when all kij are equal to k (or roughly the same), our simulation results in Figure 2 suggest that 0.75n/k and 0.4n/k are good approximations of the thresholds for complete graphs and 2D grids, respectively. On the other hand, Appendix F.3 provides a faster asymptotic alternative based on Proposition 3.8.

F.2. Additional Experiments

F.2.1. COMPARISONS TO BASELINE METHODS

We now conduct experiments to compare the performance of our test with the test proposed in (Makur & Singh, 2023) by fixing the choice function F to be the logistic function (i.e., BTL model). We assume that the observation graph is complete and define the pairwise comparison matrix P under the null and alternative hypotheses as the one used to prove the lower bound in (39) and (40). We set η = 0.06 and kij = 10 for all i = j. We evaluate the empirical type I and II error probabilities, averaged over 400 trials, for three testing methods: our proposed likelihood-based statistic with sample splitting (Max. Likelihood), the same statistic without any sample splitting (Max. Likelihood2), and the spectral approach from (Makur & Singh, 2023) (Spectral). The threshold for all three methods is determined using the empirical quantile approach as detailed in Section 4. Our results in Figure 3 indicate that the proposed method performs comparably to the spectral approach, while Max. Likelihood2 achieves even lower type I and II sum error rates. Various parameters, such as the number of simulated datasets and construction of error bars, are the same as those used in Section 4.

Hypothesis Testing for Generalized Thurstone Models

F.2.2. COMPARISON OF EMPIRICAL AND ANALYTICAL THRESHOLDS

To evaluate the reliability of our empirical-quantile-based threshold, we compare the empirical-quantile-based threshold with the optimal theoretical threshold in a well-crafted setting. Specifically, we consider the same setting as in Appendix F.2.1 and set the theoretical threshold as η2n(n 1)/2 across a range of values of n. Specifically, we vary η [0.08, 0.16] and n {15, 25, 35, 45}, while fixing kij = 10 for all i = j. For each configuration, we compute type I and II error probabilities using the empirical-quantile-based threshold (cf. Section 4) and the theoretical thresholds averaged over 400 trials. As shown in Figure 4, the empirical threshold achieves type I error rates that closely track the nominal 0.05 level and achieves nearly identical type I and II sum-error rates as that of the analytical threshold. This suggests that the empirical threshold, despite being derived from random samples of skill scores, provides performance comparable to that of the theoretically optimal threshold.

15 25 35 45 Number of items (n)

Empirical Type I Error

Max. Likelihood Max. Likelihood2 Spectral

n=45 =0.16 (n, ) Combinations

Empirical Type I Error

Empirical Quantile Threshold Theoretical Threshold

15 25 35 45 Number of items (n)

Empirical Type II Error

Max. Likelihood Max. Likelihood2 Spectral

=0.16 (n, ) Combinations

Empirical Type II Error

Empirical Quantile Threshold Theoretical Threshold

15 25 35 45 Number of items (n)

Sum of Empirical Type I and II Error

Max. Likelihood Max. Likelihood2 Spectral

Figure 3: Performance comparison of spectral method in (Makur & Singh, 2023; 2024) and our proposed method with (Max. Likelihood) and without partitioning (Max. Likelihood2) of dataset.

n=45 =0.16 (n, ) Combinations

Sum of Empirical Type I and II Error

Empirical Quantile Threshold Theoretical Threshold

Figure 4: Empirical type I and II error rates along with their sum for the Empirical Quantile method compared to a wellcrafted threshold.

Hypothesis Testing for Generalized Thurstone Models

10 20 30 40 50 60 70 n

95th Percentile Value

BTL, 2D Grid, random BTL, Complete, random BTL, 2D Grid, clustered BTL, Complete, clustered

Figure 5: Threshold computation using random versus clustered skill scores.

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Years (t)

Threshold for BTL Threshold for Thurstone Test Statistic-BTL Test Statistic-Thurstone

Figure 6: Scaled test statistic n T for BTL and Thurstone models computed on NBA dataset on cumulative data of t recent years and thresholds computed using empirical-quantilebased approach.

To address potential coverage concerns arising from the use of uniformly sampled skill scores in our empirical quantile thresholding approach, we conduct an additional experiment to test the sensitivity of the threshold. In our main setup, scores are sampled uniformly from the interval [ b, b], which may not fully reflect more structured or clustered configurations. To evaluate the impact of this assumption, we construct a clustered distribution by assigning half of the skill scores uniformly at random in the range [ 0.7, 0.4] and setting the remaining half to be their exact negatives, thereby introducing a symmetric bimodal structure. As shown in Figure 5, the empirical thresholds derived from the uniform and clustered settings are nearly identical.

F.2.3. ADDITIONAL EXPERIMENTS ON REAL-WORLD DATASET

In addition to our LMSYS dataset analysis, we apply our testing procedure to historical NBA match outcomes using the publicly available dataset from Kaggle (Lauga, 2023). We focus on games played until the 2022 season and restrict attention to the 12 teams with the highest number of games played since 2002. Each data point corresponds to a comparison between a home and an away team, and we evaluate our hypothesis test on cumulative game data aggregated over a rolling window of t recent years. As shown in Figure 6, for small values of t (approximately the first 10 seasons), the data is consistent with the BTL and Thurstone models. However, as t increases, the hypothesis is increasingly rejected, suggesting that a single latent skill score per team fails to adequately represent team strength over long horizons of time.

F.3. Confidence Intervals under Null Hypothesis

In this subsection, we will discuss a method to approximately calculate the confidence intervals under the null hypothesis, with a focus on the BTL model. While our discussion is specific to the BTL model, it can be easily generalized to other Thurstone models. Specifically, our goal is to approximately calculate the constants in Proposition 3.8 and as well as approximate the distribution of ˆF F F. For the former, we will estimate the constants by conducting some simulation, while for the latter, we will be utilizing Gaussian approximation based on the asymptotic normality of ˆw w , which was proved for the BTL model (Gao et al., 2023, Proposition 4.1). Similar results have been established for the general Thurstone models as in (Han et al., 2024).

Estimating constant c7 in Proposition 3.8: To estimate the constant, we plot several trajectories of the normalized stochastic process

v V( x(k) v pv1k)TA(k)( x(k) v pv1k) with x(k) v generated as in Lemma D.2 and the pv selected uniformly at random from (0, 1). The values of |V| varied from 10 to 100 with gaps of 10. Figure 7 plots the various trajectories of the (normalized) stochastic process as a function of k and also plots the 95th quantile for the stochastic process for all k [100]. The figure suggests that c7 0.45 is a good approximation to the value of c7 (in the fixed sample setting).

Estimating the quantile of F ˆF F: In order to estimate F ˆF F, we will utilize the asymptotic normality of vector = ˆw w . Let ˆw be computed as in (7), then under mild regularity conditions, it was shown in (Gao et al., 2023,

Hypothesis Testing for Generalized Thurstone Models

0 20 40 60 80 100 k

Quadratic deviation

95th Percentile

Figure 7: Plot of various trajectories of stochastic process in Lemma D.2.

0.2 0.3 0.4 0.5 0.6 0.7 0.8

Empirical Density

Bootstrapped F F F Gaussian approximation

Figure 8: Histogram of F ˆF 2 F based on bootstrapping versus asymptotic approximation.

20 30 40 50 60 n

Estimated Threshold Value

Complete, k=12 Complete, k=20 Toroidal Grid, k=12 Toroidal Grid, k=20

Figure 9: Estimated threshold based on Proposition 3.8.

Proposition 4.1) that (ρ1( ˆw)( ˆw1 w 1), . . . , ρn( ˆw)( ˆwn ˆw n)) d N(0, Ik),

where ρi( ˆw) = q

k P j:(i,j) E F ( ˆwi ˆwj) and d denotes the convergence in distribution. We will utilize this asymptotic

normality result to approximate the distribution of ˆF F F using the delta method as:

F ˆF 2 F = X

F w i w j F( ˆwi ˆwj) 2

(i,j) E F ( ˆwi ˆwj)2 (w i ˆwi) w j ˆwj 2

(i,j) F ( ˆwi ˆwj)2( i j)2 = 2 TLF ( ˆw) ,

where we define LF (w) to be the following matrix

(LF (w))ij =

F (wi wj)2 if (i, j) E P

j:(i,j) E F (wi wj)2 if i = j 0 otherwise.

Since is asymptotically normal (as k ), therefore we approximate the distribution of F ˆF 2 F with distribution

Hypothesis Testing for Generalized Thurstone Models

of 2 T( ˆw)LF ( ˆw) ( ˆw) where i( ˆw) N 0, 1 k P

j:(i,j) E F ( ˆ wi ˆ wj) . In Figure 8, we plot the empirical distribution

of F( ˆw) F(w ) F calculated by randomizing over the choice of partitioning of Z into Z1 and Z2. We also plot its asymptotic approximation, i.e., the empirical distribution of 2 T( ˆw)LF ( ˆw) ( ˆw). Clearly, as can be seen in Figure 8, our asymptotic approximation does indeed well approximate the empirical distribution even for a small number of samples. Finally, based on the the 95th percentile of the empirical distribution of 2 T( ˆw)LF ( ˆw) ( ˆw), we compute the expression for c7 = 0.45 and plot the estimated confidence intervals in Figure 9. It is worth noting that the estimated threshold values for complete graphs converge towards the theoretical value of 0.8, while those for toroidal grids approach 0.4. These findings are consistent with the threshold values computed via the empirical quantile method presented in Figure 1 for the respective graph topologies. Together, these observations suggest that while the exact distribution of T is difficult to characterize, its tail can be asymptotically approximated using a quadratic function of Gaussian random variables via Proposition 3.8.

G. Testing for TF Models in TV and Spectral Norms

In Section 2, we framed the hypothesis testing problem using the Frobenius norm due to its analytical tractability, particularly in the context of maximum-likelihood-type estimators. There is also significant precedent for employing quadratic distance measures in classical statistical testing for instance, the χ2-test essentially utilizes a squared weighted ℓ2-distance. However, in some applications, alternative notions of separation may be more desirable. One such example is the TV distance, which has appealing properties, such as Le Cam s relation and the data processing inequality (cf. (Makur, 2019; Makur & Zheng, 2020)).

Recall that in (11), we defined the separation distance from the class of TF models using the Frobenius norm. Alternatively, for any pairwise comparison matrix P [0, 1]n n, we can define the TV separation distance from the model class TF as

TV(P, TF ) inf w Wb 1 |E|

(i,j) E |pij F(wi wj)|.

Using TV(P, TF ) ϵ in (11) leads to an equivalent hypothesis testing problem, but with the separation measured in TV distance. Utilizing standard norm-equivalence inequalities, such as x 1 n x 2 for x Rn, we can relate this to the Frobenius norm as TV(P, TF ) inf w Wb 1 p

|E| P F(w) F.

Therefore, if the TV separation satisfies TV(P, TF ) ϵ, it follows that

inf w Wb 1 n P F(w) F ϵ

Recall from the minimax formulation in (12) that our test distinguishes between the null and alternative hypotheses with minimax risk at most 1/2, provided the Frobenius separation exceeds c/

nk (see Theorem 3.2). Utilizing the inequality (48), this implies that our test retains minimax risk at most 1/2 whenever

which implies that the TV separation distance TV(P, TF ) c p

n/(|E|k) is sufficient for testing (yielding an upper bound on critical threshold for TV distance). Notably, for complete graphs, this reduces to the critical threshold of O(1/

nk), which also matches the lower bound established by (Seshadri & Ugander, 2019) for the same notion of TV separation distance.

Finally, similar arguments allow our upper bounds to be extended to the spectral norm, using the inequality A 2 A F for any matrix A. Consequently, our test also guarantees a small minimax risk under spectral norm separation. However, obtaining corresponding lower bounds under the spectral norm remains an open problem.