# active_embedding_search_via_noisy_paired_comparisons__3bd9d250.pdf Active Embedding Search via Noisy Paired Comparisons Gregory H. Canal 1 Andrew K. Massimino 1 Mark A. Davenport 1 Christopher J. Rozell 1 Abstract Suppose that we wish to estimate a user s preference vector w from paired comparisons of the form does user w prefer item p or item q?, where both the user and items are embedded in a low-dimensional Euclidean space with distances that reflect user and item similarities. Such observations arise in numerous settings, including psychometrics and psychology experiments, search tasks, advertising, and recommender systems. In such tasks, queries can be extremely costly and subject to varying levels of response noise; thus, we aim to actively choose pairs that are most informative given the results of previous comparisons. We provide new theoretical insights into the benefits and challenges of greedy information maximization in this setting, and develop two novel strategies that maximize lower bounds on information gain and are simpler to analyze and compute respectively. We use simulated responses from a real-world dataset to validate our strategies through their similar performance to greedy information maximization, and their superior preference estimation over state-of-the-art selection methods as well as random queries. 1. Introduction We consider the task of user preference learning, where we have a set of items (e.g., movies, music, or food) embedded in a Euclidean space and aim to represent the preferences of a user as a continuous point in the same space (rather than simply a rank ordering over the items) so that their preference point is close to items the user likes and far from items the user dislikes. To estimate this point, we consider a system using the method of paired comparisons, where during a sequence of interactions a user chooses which of two given items they prefer (David, 1963). For instance, 1School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, Georgia, United States. Correspondence to: Gregory Canal . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). to characterize a person s taste in food, we might ask them which one of two dishes they would rather eat for a number of different pairs of dishes. The recovered preference point can be used in various tasks, for instance in the recommendation of nearby items, personalized product creation, or clustering of users with similar preferences. We refer to the entire process of querying via paired comparisons and continuous preference point estimation as pairwise search, and note that this is distinct from the problem of searching for a single discrete item in the fixed dataset. A key goal of ours is to actively choose the items in each query and demonstrate the advantage over non-adaptive selection. More specifically, given N items, there are O(N 2) possible paired comparisons. Querying all such pairs is not only prohibitively expensive for large datasets, but also unnecessary since not all queries are informative; some queries are rendered obvious by the accumulation of evidence about the user s preference point, while others are considered ambiguous due to noise in the comparison process. Given these considerations, the main contribution of this work is the design and analysis of two new query selection algorithms for pairwise search that select the most informative pairs by directly modeling redundancy and noise in user responses. While previous active algorithms have been designed for related paired comparison models, none directly account for probabilistic user behavior as we do here. To the best of our knowledge our work is the first attempt to search a lowdimensional embedding for a continuous point via paired comparisons while directly modeling noisy responses. Our approach builds upon the popular technique in active learning and Bayesian experimental design of greedily maximizing information gain (Settles, 2012; Lindley, 1956; Mac Kay, 1992). In our setting, this corresponds to selecting pairs that maximize the mutual information between the user s response and the unknown location of their preference point. We provide new theoretical and computational insights into relationships between information gain maximization and estimation error minimization in pairwise search, and present a lower bound on the estimation error achievable by any query strategy. Due to the known difficulty of analyzing greedy information gain maximization (Chen et al., 2015) and the high computational cost of estimating mutual information for each pair Active Embedding Search via Noisy Paired Comparisons in a pool, we propose two strategies that each maximize new lower bounds on information gain and are simpler to analyze and compute respectively. We present upper and lower bounds on the performance of our first strategy, which then motivates the use of our second, computationally cheaper strategy. We then demonstrate through simulations using a real-world dataset how both strategies perform comparably to information maximization while outperforming state-ofthe-art techniques and randomly selected queries. 2. Background 2.1. Observation Model Our goal in this paper is to estimate a user s preference point (denoted as vector w) with respect to a given lowdimensional embedding of items constructed such that distances between items are consistent with item similarities, where similar items are close together and dissimilar items are far apart. While many items (e.g., images) exist in their raw form in a high-dimensional space (e.g., pixel space), this low-dimensional representation of items and user preferences offers the advantage of simple Euclidean relationships that directly capture notions of preference and similarity, as well as mitigating the effects of the curse of dimensionality in estimating preferences. Specifically, we suppose user preferences can be captured via an ideal point model in which each item and user is represented using a common set of parameters in Rd, and that a user s overall preference for a particular item decreases with the distance between that item and the user s ideal point w (Coombs, 1950). This means that any item placed exactly at the user would be considered ideal and would be the most preferred over all other items. Although this model can be applied to the situation where a particular item is sought, in general we do not assume the user point w to be co-located with any item. The embedding of the items can be constructed through a training set of triplet comparisons (paired comparisons regarding similarity of two items to a third reference item) using one of several standard non-metric embedding techniques such as the Crowd Kernel Learning (Tamuz et al., 2011) or Stochastic Triplet Embedding methods (Van Der Maaten & Weinberger, 2012). In this study, we assume that such an embedding is given, presumably acquired through a large set of crowdsourced training triplet comparisons. We do not consider this training set to be part of the learning cost in measuring a pairwise search algorithm s efficiency, since our focus here is on efficiently choosing paired comparisons to search an existing embedding. In this work, we assume a noisy ideal point model where the probability of a user located at w choosing item p over item q in a paired comparison is modeled using P(p q) = f(kpq( w q 2 w p 2)), (1) Figure 1. Paired comparisons between items can be thought of as a set of noisy hyperplane queries. In the high-fidelity case, this uniquely identifies a convex region of Rd. In general, we have a posterior distribution which only approximates the shape of the ideal cell around the true user point, depicted with an x. where p q denotes item p is preferred to item q, f(x) = 1/(1 + e x) is the logistic function, and kpq [0, ) is the pair s noise constant, which represents roughly the signalto-noise ratio of a particular query and may depend on the values of p and q. This type of logistic noise model is common in psychometrics literature and bears similarity to the Bradley Terry model (Bradley & Terry, 1952). Note that (1) can also be written as P(p q) = f(kpq(a T w b)), where a = 2(p q) and b = p 2 q 2 encode the normal vector and threshold of a hyperplane bisecting items p and q. After a number of such queries, the response model in (1) for each query can be multiplied to form a posterior belief about the location of w, as depicted in Fig. 1. Note that we allow the noise constant kpq to differ for each item pair to allow for differing user behavior depending on the geometry of the items being compared. When kpq , this supposes a user s selection is made with complete certainty and cannot be erroneous. Conversely, kpq = 0 corresponds to choosing items randomly with probability 1/2. Varying kpq allows for differing reliability when items are far apart versus when they are close together. Some concrete examples for setting this parameter are: constant : k(1) pq = k0, (K1) normalized : k(2) pq = k0 a 1 = 1 2k0 (p q) 1, (K2) decaying : k(3) pq = k0 exp( a ) = k0 exp( 2 (p q) ). (K3) 2.2. Related Work There is a rich literature investigating statistical inference from paired comparisons and related ordinal query types. However, many of these works target a different problem Active Embedding Search via Noisy Paired Comparisons than considered here, such as constructing item embeddings (Tamuz et al., 2011), training classifiers (Guo et al., 2018), selecting arms of bandits (Jamieson et al., 2015), and learning rankings (Wauthier et al., 2013; Cao et al., 2007; Chen & Suh, 2015; Eriksson, 2013; Shah & Wainwright, 2017) or scores (Shah et al., 2016; Negahban et al., 2012) over items. Paired comparisons have also been used for learning user preferences: (Qian et al., 2015) models user preferences as a vector, but preferences are modeled as linear weightings of item features rather than by relative distances between the user and items in an embedding, resulting in a significantly different model (e.g., monotonic) of preference. (Brochu et al., 2008) considers the task of actively estimating the maximizer of an unknown preference function over items, while (Houlsby et al., 2012) and (Chu & Ghahramani, 2005) actively approximate the preference function itself, the former study notably using information gain as a metric for selecting queries. Yet, these approaches are not directly comparable to our methods since they do not consider a setting where user points are assigned within an existing low-dimensional item embedding. (Tamuz et al., 2011) does consider the same item embedding structure as our setting and actively chooses paired comparisons that maximize information gain for search, but only seeks discrete items within a fixed dataset rather than estimating a continuous preference vector as we do here. Furthermore we provide novel insights into selecting pairs via information gain maximization, and mainly treat information gain for pairwise search as a baseline in this work since our primary focus is instead on the development, analysis, and evaluation of alternative strategies inspired by this approach. The most directly relevant prior work to our setting consists of the theory and algorithms developed in (Massimino & Davenport, 2018) and (Jamieson & Nowak, 2011). In (Massimino & Davenport, 2018), item pairs are selected in stages to approximate a Gaussian point cloud that surrounds the current user point estimate and dyadically shrinks in size with each new stage. In (Jamieson & Nowak, 2011), previous query responses define a convex polytope in d dimensions (as in Fig. 1), and their algorithm only selects queries whose bisecting hyperplanes intersect this feasible region. While this algorithm in its original form only produces a rank ordering over the embedding items, for the sake of a baseline comparison we extend it to produce a preference point estimate from the feasible region. Neither of these studies fundamentally models or handles noise in their active selection algorithms; slack variables are used in the user point estimation of (Massimino & Davenport, 2018) to allow for contradicting query responses, but the presence of noise is not considered when selecting queries. In an attempt to filter non-persistent noise (the type encountered in our work), (Jamieson & Nowak, 2011) simply repeat each query multiple times and take a majority vote as the user re- sponse, but the items in the query pair are still selected using the same method as in the noiseless setting. Nevertheless, these methods provide an important baseline. 3. Query Selection We now proceed to describe the pair selection problem in detail along with various theoretical and computational considerations. We show that the goal of selecting pairwise queries to minimize estimation error leads naturally to the strategy of information maximization and subsequently to the development of our two novel selection strategies. 3.1. Minimizing Estimation Error Let W Rd (d 2) denote a random vector encoding the user s preference point, assumed for the sake of analysis to be drawn from a uniform distribution over the hypercube [ 1 2]d denoted by the prior density of p0(w). Unless noted otherwise, we denote random variables with uppercase letters, and specific realizations with lowercase letters. Let Yi {0, 1} denote the binary response to the ith paired comparison involving items pi and qi, with Yi = 0 indicating a preference for qi and Yi = 1 a preference for pi. After i queries, we have the vector of responses Y i = {Y1, Y2, . . . Yi}. We assume that each response Yi is conditionally independent from previous responses Y i 1 when conditioned on preference W. Applying this assumption in conjunction with a recursive application of Bayes rule, after i queries we have a posterior density of pi(w) p(w|Y i) = p0(w) Qi j=1 p(Yj|w) p(Yi|Y i 1) (2) where p(Yi|w) is given by the model in (1). This logistic likelihood belongs to the class of log-concave (LCC) distributions, whose probability density functions f(w) satisfy f(αw1 + (1 α)w2) f(w1)αf(w2)1 α for any w1, w2 Rd and 0 α 1. Since p0(w) is log-concave and products of log-concave functions are also log-concave (Saumard & Wellner, 2014), we have that the posterior density given in (2) is log-concave. Suppose that after i queries, the posterior pi(w) is used to produce a Bayesian user point estimate c Wi. We denote the mean squared error for this estimate by MSEi = EW |Y i[ W c Wi 2 2], which provides a direct measure of our estimation error and is a quantity we wish to minimize by adaptively selecting queries based on previous responses. One approach might be to greedily select an item pair such that MSEi+1 is minimized in expectation after the user responds. However, this would require both updating the posterior distribution and estimating MSEi+1 for each possible response over all item pairs. This would be very computationally expensive since under our model there is no Active Embedding Search via Noisy Paired Comparisons closed-form solution for MSEi+1, and so each such evaluation requires a lookahead batch of Monte Carlo samples from the posterior. Specifically, if S posterior samples are generated for each MSEi+1 evaluation over a candidate pool of M pairs at a computational cost of C per sample generation, and MSEi+1 is estimated with O(d S) operations per pair, this strategy requires O((C + d)SM) computations to select each query. This is undesirable for adaptive querying settings where typically data sets are large (resulting in a large number of candidate pairwise queries) and queries need to be selected in or close to real-time. Instead, consider the covariance matrix of the user point posterior after i queries, denoted as ΣW |Y i = E[(W E[W|Y i])(W E[W|Y i])T |Y i]. For the minimum mean squared error (MMSE) estimator, given by the posterior mean c Wi = E[W|Y i], we have MSEi = Tr(ΣW |Y i) d|ΣW |Y i| where the last inequality follows from the arithmeticgeometric mean inequality (AM GM) (Boyd & Vandenberghe, 2004). This implies that a necessary condition for a low MSE is for the posterior volume, defined here as the determinant of the posterior covariance matrix, to also be low. Unfortunately, actively selecting queries that greedily minimize posterior volume is too computationally expensive to be useful in practice since this also requires a set of lookahead posterior samples for each candidate pair and possible response, resulting in a computational complexity of O(((C + d2)S + d3)M) to select each query from the combined cost per pair of generating samples (O(CS)), estimating ΣW |Y i (O(d2S)), and calculating |ΣW |Y i| (O(d3)). 3.2. Information Theoretic Framework By utilizing statistical tools from information theory, we can select queries that approximately minimize posterior volume (and hence tend to encourage low MSE) at a more reasonable computationally cost. Furthermore, an information theoretic approach provides convenient analytical tools which we use to provide performance guarantees for the query selection methods we present. Towards this end, we define the posterior entropy as the differential entropy of the posterior distribution after i queries: hi(W) h(W|yi) = Z w pi(w) log2(pi(w))dw. (3) As we show in the following lemma, the posterior entropy of LCC distributions is both upper and lower bounded by a monotonically increasing function of posterior volume, implying that low posterior entropy is both necessary and sufficient for low posterior volume, and hence a necessary condition for low MSE. The proofs of this lemma and subsequent results are provided in the supplementary material. Lemma 3.1. For a LCC posterior distribution p(w|Y i) in d 2 dimensions, where cd = e2d2/(4 d 2 log2 2|ΣW |Y i| e2cd hi(W) d 2 log2(2πe|ΣW |Y i| This relationship between MSE, posterior volume, and posterior entropy suggests a strategy of selecting queries that minimize the posterior entropy after each query. Since the actual user response is unknown at the time of query selection, we seek to minimize the expected posterior entropy after a response is made, i.e., EYi+1[hi+1(W)|yi]. Using a standard result from information theory, we have EYi[hi(W)|yi 1] = hi(W) I(W; Yi|yi 1), where I(W; Yi|yi 1) is the mutual information between the location of the unknown user point and the user response, conditioned on previous responses (Cover & Thomas, 2012). Examining this identity, we observe that selecting queries that minimize the expected posterior entropy is equivalent to selecting queries that maximize the mutual information between the user point and the user response, referred to here as the information gain. In this setting, it is generally difficult to obtain sharp performance bounds for query selection via information gain maximization. Instead, we use information theoretic tools along with Lemma 3.1 to provide a lower bound on MSE for any estimator and query selection scheme in a manner similar to (Prasad, 2010) and (Cover & Thomas, 2012): Theorem 3.2. For any user point estimate given by c Wi after i queries, the MSE (averaged over user points and query responses) for any selection strategy is bounded by EW,Y i W c Wi 2 2 d2 2 i This result implies that the best rate of decrease in MSE one can hope for is exponential in the number of queries and slows down in a matter inversely proportional to the dimension, indicating quicker possible preference convergence in settings with lower dimensional embeddings. To estimate the information gain of a query, we can use the symmetry of mutual information to write I(W; Yi|yi 1) = H(Yi|yi 1) H(Yi|W, yi 1) (4) H(Yi|yi 1) = X Yi {0,1} p(Yi|yi 1) log2 p(Yi|yi 1) (5) H(Yi|w, yi 1) = X Yi {0,1} p(Yi|w) log2 p(Yi|w) (6) H(Yi|W, yi 1) = EW |yi 1[H(Yi|W, yi 1)]. (7) Active Embedding Search via Noisy Paired Comparisons Unlike the greedy MSE and posterior volume minimization strategies, information gain estimation only requires a single batch of posterior samples at each round of query selection, which is used to estimate the discrete entropy quantities in (4) (7). (4) can be estimated in O(d S) operations per pair, resulting in a computational cost of O(d SM) for selecting each query, which although more computationally feasible than the methods proposed so far is still likely prohibitive for highly accurate information gain estimates over a large pool of candidate pairs. Because of these analytical and computational challenges, we develop two strategies that mimic the action of maximizing information gain while being more analytically and computationally tractable, respectively. In the next section we present our first strategy, which we analyze for more refined upper and lower bounds on the number of queries needed to shrink the posterior to a desired volume. Then we introduce a second strategy which benefits from reduced computational complexity while still remaining theoretically coupled to maximizing information gain. 3.3. Strategy 1: Equiprobable, Max-variance In developing an approximation for information gain maximization, consider the scenario where arbitrary pairs of items can be generated (unconstrained to a given dataset), resulting in a bisecting hyperplane parameterized by (ai, bi). In practice, such queries might correspond to the generation of synthetic items via tools such as generative adversarial networks (Goodfellow et al., 2014). With this freedom, we could consider an equiprobable query strategy where bi is selected so that each item in the query will be chosen by the user with probability 1 2. This strategy is motivated by the fact that the information gain of query i is upper bounded by H(Yi|yi 1), which is maximized if and only if the response probability is equiprobable (Cover & Thomas, 2012). To motivate the selection of query hyperplane directions, we define a query s projected variance, denoted as σ2 i , as the variance of the posterior marginal in the direction of a query s hyperplane, i.e., σ2 i = a T i ΣW |yi 1ai. This corresponds to a measure of how far away the user point is from the hyperplane query, in expectation over the posterior distribution. With this notation, we have the following lower bound on information gain for equiprobable queries. Proposition 3.3. For any equiprobable query scheme with noise constant ki and projected variance σ2 i , for any choice of constant 0 c 1 we have I(W; Yi|yi 1) 1 hb f ckiσi (1 c) =: Lc,ki(σi) where hb(p) = p log2 p (1 p) log2(1 p). This lower bound is monotonically increasing with kiσi and achieves a maximum information gain of 1 bit at ki and/or σi (with an appropriate choice of c). This suggests choosing ai that maximize projected variance in addition to selecting bi according to the equiprobable strategy. Together, we refer to the selection of equiprobable queries in the direction of largest projected variance as the equiprobable-max-variance scheme, or EPMV for short. Our primary result concerns the expected number of comparisons (or query complexity) sufficient to reduce the posterior volume below a specified threshold set a priori, using EPMV. Theorem 3.4. For the EPMV query scheme with each selected query satisfying ki ai kmin for some constant kmin > 0, consider the stopping time Tε = min{i : |ΣW |yi| 1 d < ε} for stopping threshold ε > 0. For τ1 = d 2 log2( 1 2πeε) and τ2 = d 2 log2 e2cd 2ε , we have τ1 E[Tε] τ2 + τ2 + 1 l(τ2) 1 l(τ2) where l(x) = Lc,kmin 2 x for any constant 0 c 1 as defined in Proposition 3.3. Furthermore, the lower bound is true for any query selection scheme. This result follows from a martingale stopping-time analysis of the entropy at each query. Our next theorem presents a looser upper bound, but is more easily interpretable. Theorem 3.5. The EPMV scheme, under the same assumptions as in Theorem 3.4, satisfies E[Tε] = O d log 1 ε + 1 εk2 min Furthermore, for any query scheme, E[Tε] = Ω d log 1 This result has a favorable dependence on the dimension d, and the upper bound can be interpreted as a blend between two rates, one of which matches that of the generic lower bound. The second term in the upper bound provides some evidence that our ability to recover w worsens as kmin decreases. This is intuitively unsurprising since small kmin corresponds to the case where queries are very noisy. We hypothesize that the absence of such a penalty term in the lower bound is an artifact of our analysis, since increasing noise levels (i.e., decreasing kmin) should limit achievable performance by any querying strategy. On the other hand, for asymptotically large ki, we have the following corollary: Corollary 3.1. In the noiseless setting (kmin ), EPMV has optimal expected stopping time complexity for posterior volume stopping. Proof. When kmin , from Theorem 3.5 E[Tε] = O d log 1 ε ; for any scheme, E[Tε] = Ω d log 1 Taken together, these results suggest that EPMV is optimal with respect to posterior volume minimization up to a Active Embedding Search via Noisy Paired Comparisons penalty term which decreases to zero for large noise constants. While low posterior volume is only a necessary condition for low MSE, this result could be strengthened to an upper bound on MSE by bounding the condition number of the posterior covariance matrix, which is left to future work. Yet, as we empirically demonstrate in Section 4, in practice our methods are very successful in reducing MSE. While EPMV was derived under the assumption of arbitrary hyperplane queries, depending on the application we may have to select a pair from a fixed pool of items in a given dataset. For this purpose we propose a metric for any candidate pair that, when maximized over all pairs in a pool, approximates the behavior of EPMV. For a pair with items p and q in item pool P, let apq = 2(p q) and bpq = p 2 q 2 denote the weights and threshold parameterizing the bisecting hyperplane. We choose a pair that maximizes the utility function (for some λ > 0) η1(p, q; λ) = kpq q a TpqΣW |Y i 1apq λ bp1 1 bp1 = P(Yi =1|Y i 1) = EW |Y i 1[f(kpq(a T pq W bpq))]. This has the effect of selecting queries which are close to equiprobable and align with the direction of largest variance, weighted by kpq to prefer higher fidelity queries. While ΣW |Y i 1 can be estimated once from a batch of posterior samples, bp1 must be estimated for each candidate pair in O(d S) operations, resulting in a computational cost of O(d SM) which is on the same order as directly maximizing information gain. For this reason, we develop a second strategy that approximates EPMV while significantly reducing the computational cost. 3.4. Strategy 2: Mean-cut, Max-variance Our second strategy is a mean-cut strategy where bi is selected such that the query hyperplane passes through the posterior mean, i.e. a T i E[W|Y i 1] bi = 0. For such a strategy, we have the following proposition: Proposition 3.6. For mean-cut queries with noise constant ki and projected variance σ2 i we have p(Yi|yi 1) 1 and, I(W; Yi|yi 1) hb For large projected variances, we observe that |p(Yi|yi 1) 1 2| 0.14, suggesting that mean-cut queries are somewhat of an approximation to equiprobable queries in this setting. Furthermore, notice that the lower bound to information gain in Proposition 3.6 is a monotonically increasing function of the projected variance. As σi , this bound approaches hb(1/e) 0.95 which is nearly sharp since a Algorithm 1: Pairwise search with noisy comparisons Input: item set X, parameters S, β, λ P set of all pairwise queries from items in X f W0, µ0, Σ0 initialize from samples of prior for i = 1 to T do Pβ uniformly downsample P at rate 0 < β 1 Info Gain: pi, qi arg max p,q Pβ η0(p, q; f Wi 1) EPMV: pi, qi arg max p,q Pβ η1(p, q; λ, f Wi 1) MCMV: pi, qi arg max p,q Pβ η2(p, q; λ, µi 1, Σi 1) yi Paired Comparison(pi, qi) , yi yi yi 1. f Wi batch of S samples from posterior W|Y i µi, Σi Mean(f Wi), Covariance(f Wi) c Wi µi end for Output: user point estimate c WT query s information gain is upper bounded by 1 bit. This implies some correspondence between maximizing a query s information gain and maximizing the projected variance, as was the case in EPMV. Hence, our second strategy selects mean-cut, maximum variance queries (referred to as MCMV) and serves as an approximation to EPMV while still maximizing a lower bound on information gain. For implementing MCMV over a fixed pool of pairs (rather than arbitrary hyperplanes), we calculate the orthogonal distance of each pair s hyperplane to the posterior mean as |a T pq E[W|Y i 1] bpq|/ apq 2 and the projected variance as a T pqΣW |Y i 1apq. We choose a pair that maximizes the following function which is a tradeoff (tuned by λ > 0) between minimizing distance to the posterior mean, maximizing noise constant, and maximizing projected variance: η2(p, q; λ) = kpq q a TpqΣW |Y i 1apq λ|a T pq E[W|Y i 1] bpq| apq 2 . (9) This strategy is attractive from a computational standpoint since the posterior mean E[W|Y i 1] and covariance ΣW |Y i 1 can be estimated once in O(d2S) computations, and subsequent calculation of the hyperplane distance from mean and projected variance requires only O(d2) computations per pair. Overall, this implementation of the MCMV strategy has a computational complexity of O(d2(S + M)), which scales more favorably than both the information gain maximization and EPMV strategies. We unify the information gain (referred to as Info Gain), EPMV, and MCMV query selection methods under a single framework described in Algorithm 1. At each round of querying, a pair is selected that maximizes a utility function Active Embedding Search via Noisy Paired Comparisons η(p, q) over a randomly downsampled pool of candidates pairs, with η0(p, q) I(W; Yi|yi 1) for Info Gain and η1 from (8) and η2 from (9) denoting the utility functions of EPMV and MCMV, respectively. We include a batch of posterior samples denoted by f W as an input to η0 and η1 to emphasize their dependence on posterior sampling, and add mean and covariance inputs to η2 since once these are estimated, MCMV requires no additional samples to select pairs. For all methods, we estimate the user point as the mean of the sample batch since this is the MMSE estimator. To evaluate our approach, we constructed a realistic embedding (from a set of training user-response triplets) consisting of multidimensional item points and simulated our pairwise search methods over randomly generated preference points and user responses1. We constructed an item embedding of the Yummly Food-10k dataset of (Wilber et al., 2015; 2014), consisting of 958,479 publicly available triplet comparisons assessing relative similarity among 10,000 food items. The item coordinates are derived from the crowdsourced triplets using the popular probabilistic multidimensional scaling algorithm of (Tamuz et al., 2011) and the implementation obtained from the NEXT project2. 4.1. Methods Comparison We compare Info Gain, EPMV, and MCMV as described in Algorithm 1 against several baseline methods: Random: pairs are selected uniformly at random and user preferences are estimated as the posterior mean. Gauss Cloud-Q: pairs are chosen to approximate a Gaussian point cloud around the preference estimate that shrinks dyadically over Q stages, as detailed in (Massimino & Davenport, 2018). Act Rank-Q: pairs are selected that intersect a feasible region of preference points and queried Q times; a majority vote is then taken to determine a single response, which is used with the pair hyperplane to further constrain the feasible set (Jamieson & Nowak, 2011). Since the original goal of the algorithm was to rank embedding items rather than estimate a continuous preference point, it does not include a preference estimation procedure; in our implementation we estimate user preference as the Chebyshev center of the feasible region since it is the deepest point in the set and is simple to compute (Boyd & Vandenberghe, 2004). In each simulation trial, we generate a point W uniformly at random from the hypercube [ 1, 1]d and collect paired comparisons using the item points in our embedding according to the methods described above. The response probability 1Code available at https://github.com/siplab-gt/pairsearch 2http://nextml.org of each observation follows (1) (referred to herein as the logistic model), using each of the three schemes for choosing kpq described in (K1) (K3). In each scheme we optimized the value of k0 over the set of training triplets via maximumlikelihood estimation according to the logistic model. We use the Stan Modeling Language (Carpenter et al., 2017) to generate posterior samples when required, since our model is LCC and therefore is particularly amenable to Markov chain Monte Carlo methods (Brooks et al., 2011). Note that unlike Gauss Cloud-Q and Act Rank-Q, the Random, Info Gain, EPMV, and MCMV methods directly exploit a user response model in the selection of pairs and estimation of preference points, which can be advantageous when a good model of user responses is available. Below we empirically test each method in this matched scenario, where the noise type (logistic) and the model for kpq (e.g., constant , normalized , or decaying ) are revealed to the algorithms. We also test a mismatched scenario by generating response noise according to a non-logistic response model while the methods above continue to calculate the posterior as if the responses were logistic. Specifically, we generate responses according to a Gaussian model yi = sign(kpq(a T i w bi) + Z) Z N(0, 1) where k0 and the model for kpq are selected using maximum-likelihood estimation on the training triplets. 4.2. Mean Squared Error Evaluation The left column of Fig. 2 plots the MSE of each method s estimate with respect to the ground truth location over the course of a pairwise search run. In the matched model case of Fig. 2a, our strategies outperform Random, Act Rank-Q, and Gauss Cloud-Q for multiple values of Q by a substantial margin. Furthermore, both of our strategies performed similarity to Info Gain, corroborating their design as information maximization approximations. Note that Random outperforms the other baseline methods, supporting the use of Bayesian estimation in this setting (separately from the task of active query selection). Although mismatched noise results in decreased performance overall in Fig. 2c, the same relative trends between the methods as in Fig. 2a are evident. 4.3. Item Ranking Evaluation We also consider each method s performance with respect to ranking embedding items in relation to a preference point. For each trial, a random set of 15 items is sampled from the embedding without replacement and ranked according to their distance to a user point estimate. This ranking is compared to the ground truth ranking produced by the true user point by calculating a normalized Kendall s Tau distance, which is 0 for identical rankings and 1 for completely discordant rankings (Jamieson & Nowak, 2011). This metric Active Embedding Search via Noisy Paired Comparisons (a) Estimation error: matched logistic noise, d = 4 (b) Ranking performance: matched logistic noise, d = 4 (c) Estimation error: mismatched Gaussian noise, d = 4 (d) Ranking performance: mismatched Gaussian noise, d = 4 Figure 2. Performance evaluation over 80 simulated search queries, averaged over 50 trials per method and plotted with one standard error. (Left Column) MSE. (Right Column) for each trial, a batch of 15 items was uniformly sampled without replacement from the dataset, and the normalized Kendall s Tau distance (lower distance is better) was calculated between a ranking of these items by distance to the ground truth preference point and a ranking by distance to the estimated point. To get an unbiased estimate, this metric was averaged over 1000 batches per trial, and error bars calculated with respect to the number of trials. (Top Row) normalized logistic model with matching noise in d = 4. (Bottom Row) decaying logistic model with mismatched Gaussian normalized noise in d = 4. Additional plots testing a wider selection of parameters are available in the supplement. Overall, our new strategies (EPMV, MCMV) outperform existing methods and also perform comparably to information gain maximization (Info Gain), which they were designed to approximate. measures performance in the context of a recommender system type task (a common application of preference learning) rather than solely measuring preference estimation error. This metric is depicted in the right column of Fig. 2, for the matched model case in 2b and mismatched case in 2d. The same trends as observed in MSE analysis occur, with our strategies performing similarly to Info Gain and outperforming all other methods. This is a particularly noteworthy result in that our method produces more accurate rankings than Act Rank-Q, which to our knowledge is the state-ofthe-art method in active embedding ranking. 4.4. Discussion Our simulations demonstrate that both Info Gain approximation methods, EPMV and MCMV, significantly outperform the state-of-the-art techniques in active preference estimation in the context of low-dimensional item embeddings with noisy user responses, and perform similarity to Info Gain, the method they were designed to approximate. This is true even when generating noise according to a different model than the one used for Bayesian estimation. These empirical results support the theoretical connections between EPMV, MCMV, and Info Gain presented in this study, and suggest that the posterior volume reduction properties of EPMV may in fact allow for MSE reduction guarantees. These results also highlight the attractiveness of MCMV, which proved to be a top performer in embedding preference learning yet is computationally efficient and simple to implement. This technique may also find utility as a subsampling strategy in supervised learning settings with implicit pairwise feedback, such as in (Wu et al., 2017). Furthermore, although in this work pairs were drawn from a fixed embedding, MCMV is easily adaptable to continuous item spaces that allow for generative construction of new items to compare. This is possible in some applications, such as facial composite generation for criminal cases (Frowd et al., 2011) or in evaluating foods and beverages, where we might be able to generate nearly arbitrary stimuli based on the ratios of ingredients (Ventura et al., 2011). Active Embedding Search via Noisy Paired Comparisons Acknowledgements We thank the reviewers for their useful feedback and comments, as well as colleagues for insightful discussions including Matthieu Bloch, Yao Xie, Justin Romberg, Stefano Fenu, Marissa Connor, and John Lee. This work is supported by NSF grants CCF-1350954 and CCF-1350616, ONR grant N00014-15-1-2619, and a gift from the Alfred P. Sloan Foundation. Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge university press, 2004. Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324 345, 1952. Brochu, E., Freitas, N. D., and Ghosh, A. Active preference learning with discrete choice data. In Advances in neural information processing systems, pp. 409 416, 2008. Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. Handbook of markov chain monte carlo. CRC press, 2011. Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., and Li, H. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pp. 129 136. ACM, 2007. Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. Stan: A probabilistic programming language. Journal of statistical software, 76(1), 2017. Chen, Y. and Suh, C. Spectral mle: Top-k rank aggregation from pairwise comparisons. In International Conference on Machine Learning, pp. 371 380, 2015. Chen, Y., Hassani, S. H., Karbasi, A., and Krause, A. Sequential information maximization: When is greedy nearoptimal? In Conference on Learning Theory, pp. 338 363, 2015. Chu, W. and Ghahramani, Z. Preference learning with gaussian processes. In Proceedings of the 22nd international conference on Machine learning, pp. 137 144. ACM, 2005. Coombs, C. H. Psychological scaling without a unit of measurement. Psychological review, 57(3):145, 1950. Cover, T. M. and Thomas, J. A. Elements of information theory. John Wiley & Sons, 2012. David, H. A. The method of paired comparisons, volume 12. London, 1963. Eriksson, B. Learning to top-k search using pairwise comparisons. In Artificial Intelligence and Statistics, pp. 265 273, 2013. Frowd, C., Bruce, V., Pitchford, M., Gannon, C., Robinson, M., Tredoux, C., Park, J., Mcintyre, A., and Hancock, P. J. Evolving the face of a criminal: how to search a face space more effectively. Soft Computing, 15(1):61 70, 2011. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Guo, Y., Tian, P., Kalpathy-Cramer, J., Ostmo, S., Campbell, J. P., Chiang, M. F., Erdogmus, D., Dy, J. G., and Ioannidis, S. Experimental design under the bradley-terry model. In IJCAI, pp. 2198 2204, 2018. Houlsby, N., Huszar, F., Ghahramani, Z., and Hernández Lobato, J. M. Collaborative gaussian processes for preference learning. In Advances in Neural Information Processing Systems, pp. 2096 2104, 2012. Jamieson, K. G. and Nowak, R. Active ranking using pairwise comparisons. In Advances in Neural Information Processing Systems, pp. 2240 2248, 2011. Jamieson, K. G., Katariya, S., Deshpande, A., and Nowak, R. D. Sparse dueling bandits. In AISTATS, 2015. Lindley, D. V. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, pp. 986 1005, 1956. Mac Kay, D. J. Information-based objective functions for active data selection. Neural computation, 4(4):590 604, 1992. Massimino, A. K. and Davenport, M. A. As you like it: Localization via paired comparisons. ar Xiv preprint ar Xiv:1802.10489, 2018. Negahban, S., Oh, S., and Shah, D. Iterative ranking from pair-wise comparisons. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems, pp. 2474 2482, 2012. Prasad, S. Certain relations between mutual information and fidelity of statistical estimation. ar Xiv preprint ar Xiv:1010.1508, 2010. Qian, L., Gao, J., and Jagadish, H. Learning user preferences by adaptive pairwise comparison. Proceedings of the VLDB Endowment, 8(11):1322 1333, 2015. Active Embedding Search via Noisy Paired Comparisons Saumard, A. and Wellner, J. A. Log-concavity and strong log-concavity: a review. Statistics surveys, 8:45, 2014. Settles, B. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1 114, 2012. Shah, N. B. and Wainwright, M. J. Simple, robust and optimal ranking from pairwise comparisons. Journal of machine learning research, 18:199 1, 2017. Shah, N. B., Balakrishnan, S., Bradley, J., Parekh, A., Ramchandran, K., and Wainwright, M. J. Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence. The Journal of Machine Learning Research, 17(1):2049 2095, 2016. Tamuz, O., Liu, C., Belongie, S., Shamir, O., and Kalai, A. T. Adaptively Learning the Crowd Kernel. In International Conference on Machine Learning (ICML ), May 2011. Van Der Maaten, L. and Weinberger, K. Stochastic triplet embedding. In Machine Learning for Signal Processing (MLSP), 2012 IEEE International Workshop on, pp. 1 6. IEEE, 2012. Ventura, E. E., Davis, J. N., and Goran, M. I. Sugar content of popular sweetened beverages based on objective laboratory analysis: focus on fructose content. Obesity, 19 (4):868 874, 2011. Wauthier, F., Jordan, M., and Jojic, N. Efficient ranking from pairwise comparisons. In International Conference on Machine Learning, pp. 109 117, 2013. Wilber, M., Kwak, I. S., Kriegman, D., and Belongie, S. Learning concept embeddings with combined humanmachine expertise. In Proceedings of the IEEE International Conference on Computer Vision, pp. 981 989, 2015. Wilber, M. J., Kwak, I. S., and Belongie, S. J. Cost-effective hits for relative similarity comparisons. In Second AAAI conference on human computation and crowdsourcing, 2014. Wu, L., Hsieh, C.-J., and Sharpnack, J. Large-scale collaborative ranking in near-linear time. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 515 524. ACM, 2017.