# nearneighbor_methods_in_random_preference_completion__b698a4c9.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Near-Neighbor Methods in Random Preference Completion

Ao Liu Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180, USA liua6@rpi.edu

Qiong Wu, Zhenming Liu Department of Computer Science College of William and Mary Williamsburg, VA 23187, USA qwu05@email.wm.edu and zliu@cs.wm.edu

Lirong Xia Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180, USA xial@cs.rpi.edu

This paper studies a stylized, yet natural, learning-to-rank problem and points out the critical incorrectness of a widely used nearest neighbor algorithm. We consider a model with n agents (users) {xi}i [n] and m alternatives (items) {yl}l [m], each of which is associated with a latent feature vector. Agents rank items nondeterministically according to the Plackett-Luce model, where the higher the utility of an item to the agent, the more likely this item will be ranked high by the agent. Our goal is to identify near neighbors of an arbitrary agent in the latent space for prediction. We ﬁrst show that the Kendall-tau distance based k NN produces incorrect results in our model. Next, we propose a new anchor-based algorithm to ﬁnd neighbors of an agent. A salient feature of our algorithm is that it leverages the rankings of many other agents (the so-called anchors ) to determine the closeness/similarities of two agents. We provide a rigorous analysis for one-dimensional latent space, and complement the theoretical results with experiments on synthetic and real datasets. The experiments conﬁrm that the new algorithm is robust and practical.

1 Introduction In a learning-to-rank problem, there is a set of agents (users) X = {x1, . . . xn} and a set of alternatives (items) Y = {y1, . . . ym}. Each agent reveals her preferences over a subset of alternatives. The goal is to infer agents preferences over all alternatives, including those that are not rated or ranked. This fundamental machine learning problem has many practical applications. For example, recommender systems use an agent revealed preferences to discover other alternatives she might be interested in; product designers learn from consumers past choices to estimate the demand curve of a new product; defenders can predict terrorists preferences based on their past behavior; and political parties can evaluate campaign options based on voters preferences. See (?) for a recent survey. Rating vs. ranking. Agents preferences can be represented by either a rating for each alternative (e.g., an integer rating in Netﬂix), or a ranking over the alternatives (i.e., complete ordering). Rating-based approaches have many known drawbacks (?; ?), including (i) agents often have different

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

scales for ratings; and (ii) numeric values are often less robust than ranking-based approaches. In fact, rating data can always be converted to ranking data (e.g., y1 is ranked higher than y2 if y1 has a higher rating) and thus ranking-based models and algorithms are more general. We focus on ranking data. A common approach to infer an agent s preference is to ﬁrst identify near neighbors of the agent in terms of the Kendall-Tau (KT) distance, then aggregate their rankings to produce a prediction. The KT distance is a metric that counts the number of pairwise disagreements between two ranking lists. This approach was proposed by ? (?), and their algorithm will be referred to as KT-k NN in this paper. Many subsequent work are based on the following assumption (?; ?; ?; ?; ?). Assumption 1. KT distance is a good measure of similarity between agents. No theoretical justiﬁcation for this assumption was known until recently. ? (?) proposed a latent utility model to justify the assumption. In their model, each agent or alternative is associated with a latent feature. Alternative j s utility to agent i is controlled by a deterministic function of the similarity in their latent features. Under this model, consistency result is established for the KT-k NN algorithms. However, this model assumes that agents preferences are deterministic, which is unrealistic in many settings. For example, an agent can exhibit irrational behavior, or provide only a noisy version of her preferences. In fact, human preferences are often highly non-deterministic. Various statistical models have been built to model such randomness, pioneered by the Nobel Laureate Mc Fadden (?) among many other researchers. Therefore, the following question remains open. How can we learn an agent s random preferences from other agents random preferences? This question can be answered by designing algorithms for two closely-related problems: (i) preference completion (PC): given each agent s preferences over a subset of alternatives, the goal is to estimate its preference over all the alternatives. (ii) near neighbors (NN): given an agent xi, the goal is to ﬁnd agents xi close to xi in the latent space. Standard techniques exist to use algorithms for the NN problem to solve a PC problem ( (?; ?); see also Appendix D). Therefore, we focus on the NN problem in this paper.

Our Contributions. Our main conceptual contribution is the combination of a distance-based latent model and random preferences for learning to rank. To the best of our knowledge, while there is a large literature in each component, we are the ﬁrst to consider both. See related work for more discussions. Our model is called distance-based random preference model. Let the latent feature of agent i (alternative j) be xi (yj). Agent i s preferences are determined by a utility function u(xi, yj) = θ(xi, yj) + ϵi,j, where θ(xi, yj) is a deterministic monotonically decreasing distance-based function and ϵi,j is a zero mean independent random variable. Our model captures two pervasive characteristics of ranking datasets: Ch1. Economically meaningful θ( , ) function. u(xi, yj) is high in expectation when xi and yj are close. An agent is more likely to prefer alternatives with similar latent features to itself. Ch2. Random preference model. The function u(xi, yj) contains a noise term ϵi,j to capture uncertainties in agents behaviors. Our technical contributions are two-fold. First, we prove that Assumption 1 does not hold anymore in our distancebased random preference model. More precisely, we prove that the agents found by the KT-k NN algorithm (?) is far away from the given agents with high probability, even when n, m . Second, we design an anchor-based algorithm for ﬁnding an agent s near neighbors under random preferences. The algorithm is based on the following natural idea: if two agents i1 and i2 are close, then their KT distance to any other agent j (an anchor) should also be close. The algorithm proceeds by using the KT distance to other agents as an agent s feature, and measures the closeness between two agents by the L1 distance of their features. We prove that asymptotically our algorithm identiﬁes an agent s near neighbors with high probability when the latent space is 1dimensional. Many techniques we developed can be generalized to high-dim settings. Experiments on synthetic data verify our theoretical ﬁndings, and demonstrate that our algorithm is robust in highdim spaces. Experiments on Netﬂix data shows that our anchor-based algorithm is superior to the KT-k NN algorithm and a standard collaborative ﬁlter (using the cosinesimilarities to determine neighbors). Related Work and Discussions. While using random utility models in learning-to-rank problems is not new (?; ?; ?; ?; ?; ?; ?), we are not aware of any that simultaneously achieves both Ch1 and Ch2. Random utility-based ranking algorithms (?; ?; ?) address Ch2, but the function θ(xi, yj) often does not have an explicit economics interpretation. For example, let Θ Rn m be a matrix such that Θi,j = θ(xi, yj). (?; ?) assume that Θ is low rank. But the low rank assumption does not have explicit economically interpretation. While recent non-parametric models (e.g., (?)) allow one to use economically interpretable functions θ (addressing Ch1), they operate only under deterministic utility models. Parametric preference learning has been extensively studied in machine learning, especially learning to rank (?; ?; ?; ?; ?; ?; ?). These works are different with ours as it is of-

ten assumed that agents preferences are generated from a parametric model.

2 Preliminaries

Distance-Based Random Preference Model. Let X = {x1, . . . , xn} Rd denote the set of agents and let Y = {y1, . . . , ym} Rd denote the set of alternatives. We slightly abuse the notation and use xi to refer to both agent i and her latent features. Each agent xi has a ranking (preference list) Ri = [yj1 yjm] over Y, where means prefer to . We observe only a subset of Ri for each i [n]. Utility functions and the random utility model. Agent i s expected utility on alternative j is determined by a function θ(xi, yj). Throughout this paper, we use θ(xi, yj) = exp( xi yj 2), where . 2 is the ℓ2-norm. Agent i s ranking Ri is determined by the widely-used Plackett-Luce model (?; ?). The realized utility of alternative yj for agent i is generated by u(xi, yj) θ(xi, yj) + ϵi,j, where ϵi,j is a zero mean independent random variable that follows the Gumbel distribution. Then, agent i ranks the alternatives in decreasing order of their realized utilities. The density function of the Plackett-Luce model has a closedform formula. Let yj1 i yj2 represent that yj1 is ahead of yj2 in Ri and let j1, j2, . . . jm be a permutation of [m]. We have

Pr [yj1 i i yjm] =

θ(xi, yjt) Pm t =t θ(xi, yjt ). (1)

The marginal distribution between alternatives j1 and j2 is Pr[yj1 i yj2] = θ(xi,yj1) θ(xi,yj1)+θ(xi,yj2).

Distributions of X and Y. xi and yj are i.i.d. generated from distributions DX and DY . The supports of DX and DY are a cube B(d) in Rd, where B(d) = {v Rd : v c}, where c is a constant. We adopt the standard near uniform assumption for DX and DY (?; ?; ?; ?). Deﬁnition 1. Consider a continuous distribution D on B(d) with probability density function f D(x). D is near-uniform if sup f D(x)

inf f D(x) is bounded by a constant, where x B(d).

Let f X and f Y be the PDFs of DX and DY respectively. Deﬁne c X = sup f X(x)

inf f X(x) and c Y = sup f Y (x)

inf f Y (x) .

Observation model. We observe only agent i s ranking over a subset Oi [m] of alternatives. Each alternative j is in Oi independently with probability p. The Oi s are also independently generated across different agents. Let ROi be the ordered list over Oi Y that is consistent with R (i.e., ROi is the partial ranking of R over Oi). For each agent i, we observe ROi i . The near neighbor problem. Here, an algorithm needs to ﬁnd near neighbors of an input agent. An algorithm is a k(n, m)-NN solver with parameter τ(n) if for any input agent i, the algorithm outputs k agents i1, i2, . . . ik, and with overwhelming probability, |xi xij| τ(n), where τ(n) = o(1).

Algorithm 1: KT-k NN (it produces incorrect results)

1 Input: {ROj j }j [n], k, and an agent xi.

2 Output: k neighbors near agent i in the latent space.

3 Find j1, j2, , jn 1( [n]/{i}) such that

NK(ROi i , R Oj1 j1 ) NK(ROi i , R Ojn 1 jn 1 )

4 Return XKT-k NN {j1, . . . jk}

We often write k-NN or k NN instead of k(n, m)-NN when k s dependencies on m and n are not critical. Additional notations and examples. For an arbitrary ordered list R, we use it calligraphic form R to extract the rank of an alternative. For example, suppose yj is the top-ranked alternative in R, then R(yj) = 1. Let I(v) be an indicator that sets to 1 if the argument v is true; if false, it sets to 0. Let |R| be the length of the list R. Let R1 and R2 be two ordered lists over the same set of alternatives. The normalized Kendall-Tau distance between R1 and R2 is

NK(R1, R2) = 1 |R1| 2 X

j1 =j2 R1 I R1(j1) R1(j2)

R2(j1) R2(j2) < 0 . (2) When R1 and R2 do not have the same support, the normalized KT distance is deﬁned as NK(RO 1 , RO 2 ), where O = R1 R2. To facilitate analysis, sometimes we need to introduce new agents outside X. For a new agent with latent features x, let Rx denote its ranking over Y and let ROx x denote the observed ranking. Conditional probability and expectations. There are multiple levels of randomness for producing the rankings Ri s: (i) xi and yj are random and (ii) u(xi, yj) consists of a random component (i.e., randomness from the Plackett-Luce model). Care must be taken when operating the conditional random variables deﬁned in our process. For example,

E[NK(Ri1, Ri2 | X] refers to ﬁxing the latent positions of the agents and taking expectations over Y and randomness from the Plackett-Luce model.

E[NK(Ri1, Ri2) | X, Y] refers to ﬁxing the latent positions of both alternatives and agents and taking expectations over randomness from the Plackett-Luce model.

3 Inefﬁcacy of KT-k NN In this section, we will prove the inefﬁcacy of KT-k NN algorithm by ? (?) (Algorithm 1) in our distanced-based random preference model. This implies that Assumption 1 does not hold in our model. Recall that KT-k NN uses KT distances to ﬁnd an agent s neighbors based on the intuition that when xi and xj are close, their opinion on alternatives utilities should be similar. The next theorem show that this intuition does not hold in our model, by proving that KT-k NN does not return any near neighbors for a large fraction of xi.

Theorem 1. Consider Algorithm KT-k NN under distancebased random preference model in which d = 1 and p = 1. Let DY and DX be uniform distributions on [ 1, 1]. For any constant ϵ, any xi [ 1 + ϵ, 0.5] [0.5, 1 ϵ], and any k n/ ln5 n, we have

min x KT-k NN({R Oj j }j,k,xi) ||x xi||2 ϵ = Ω(1). (3)

with high probability. The probability comes from random X/{xi}, random Y, and random preferences.

Remarks. Theorem 1 states that KT-k NN fails to work even for the simple case where d = 1 and DX = DY = Uniform([ 1, 1]). Eq. (3) is a strong result because trivial algorithms exist to ﬁnd an agent xj whose distance to xi is Θ(1) (just picking up an arbitrary xj). In addition, this result continues to hold for large populations (e.g., when n, m and p = 1), suggesting that the limitation of the KT-based approach roots at the structural properties of the NK function. In addition, if we use KT-k NN to solve PC problem by applying standard techniques, it will also produce poor results (see Appendix D and Lemma 8 there). Comparison to (?). ? (2008) proved that KT-k NN is effective under the deterministic utility model. This suggests that with the presence of uncertainties in the utility function (a more realistic assumption), the algorithmic structure of the NN problem is signiﬁcantly altered. Intuitions behind Theorem 1. The following example highlights the salient structures of KT-distances.

Example 1 (Near-neighbors in expectation). Let x1 = 0, y1 = 0.5, and y2 = 1. Let x = arg minx E[NK(Rx, R1) | x1, x, Y] (e.g., where would we place an agent that minimizes its KT distance to x1?). One would hope that when x and x1 are close, Rx and R1 is close, but here x = 0.5. Speciﬁcally, let a be the probability x1 prefers y1 to y2 (i.e., a θ(x1,y1) θ(x1,y1)+θ(x1,y2)) and let b(x) be the probability x prefers y1 to y2. Recall that agents support is [ 1, 1]. We need to solve

x = arg min x E[NK(Rx, R1) | x1, x, Y]

= arg min x a(1 b(x)) + (1 a)b(x).

Here, we aim to minimize the weighted sum of a [0, 1] and (1 a) via controlling b(x). The optimal solution has a simple structure: when a > (1 a) (equivalently, a > 0.5), we need to set the weight associated with a as small as possible, which means setting b(x) to the largest possible value. When a < (1 a), b(x) needs to be minimized. Thus, the optimal solution uses the following threshold rules (assume a = 0.5 for simplicity).

x ( 1, y1) if a > 0.5 (y2, 1) if a < 0.5 .

This minimizer is far from x1.

See also Example 3 in Appendix B.5 for another small and concrete example, in which KT-k NN produces poor output.

3.1 Proof sketch of Theorem 1 We use intuitions from Example 1 to prove the theorem. Speciﬁcally, deﬁne

Gi(x) E[NK(Ri, Rx) | xi, x].

First, note that NK(Ri, Rx) concentrates at Gi(x) when m is sufﬁciently large. This comes from the concentration behavior of the NK function: Lemma 1. Let µ = E[NK(Ri, Rj) | X] = Gi(xj). We have

Pr [|NK(Ri, Rj) µ| δµ | X] 4m exp δ2mµ

See Appendix B.1 for the proof. The terms in NK are not independent terms so we cannot directly apply Chernoff bounds. Our proof uses the combinatorial structure of the NK function to decouple the dependencies among terms. The technique we develop can be of independent interests. Let x = arg minx Gi(x). Below is our main lemma: Lemma 2. Let DY be uniform distribution on [ 1, 1]. Let xi be any agent in [ 1, 0.5]. We have

arg min x Gi(x) = 1.

Similarly, when xi [0.5, 1], arg minx Gi(x) = 1. For any xi [ 1, 0.5] [0.5, 1], Lemma 1 and Lemma 2 give us:

min i NK(Ri , Ri) min i Gi(xi ) Gi( 1),

where the ﬁrst approximation comes from the concentration bound of NK and the second approximation comes from the fact that there must exist one agent close to -1 when the number of agent is large. Therefore, all the neighbors produced by KT-k NN are far from xi (see Appendix B.4 for a rigorous analysis).

Proof of Lemma 2. By linearity of expectation, we have

Gi(x) = 1 m 2 X

ℓ1 =ℓ2 E h NK R {yℓ1 ,yℓ2 } i , R {yℓ1 ,yℓ2 } x x, xi i

= E h NK R {yℓ1 ,yℓ2 } i , R {yℓ1 ,yℓ2 } x x, xi i .

The last equality holds because yj s are i.i.d. samples from DY . Deﬁne

pi(y1, y2) Pr[y1 i y2 | y1, y2, xi] = θ(xi, y1) θ(xi, y1) + θ(xi, y2) .

px(y1, y2) Pr[y1 x y2 | y1, y2, x] = θ(x, y1) θ(x, y1) + θ(x, y2) .

When the context is clear, we shall refer to pi(y1, y2) and px(y1, y2) as pi and px, respectively. We have

Gi(x) = Ey1,y2 px(1 pi) + (1 px)pi | xi, x .

One can see that Gi(x) is a smooth function (the ﬁrst derivative exists). Our proof consists of three parts. Part 1. When x ( 1, xi], Gi(x)/ x > 0, Part 2. When x [xi, xi], Gi(x) Gi(xi) > 0, and Part 3. When x [ xi, 1), Gi(x)/ x > 0.

The proof for part 3 is similar to part 1. Proving part 2 is also simpler. Therefore, we focus only on the proof for part 1. Proof for part 2 and 3 is deferred to Appendix B.2. We now show that when x ( 1, xi], Gi(x)/ x > 0. We have (see Fact 2 in Appendix B.2):

x = E [Φ(y1, y2, x, xi)| xi, x] , (4)

Φ(y1, y2, x, xi) e (i) 1,2 1

e (i) 1,2+1 sign(y1 x) sign(y2 x)

4 cosh2 (x) 1,2/2

(i) 1,2 |y2 xi| |y1 xi| (x) 1,2 |y2 x| |y1 x|

Here, (i) 1,2 ( (x) 1,2) measures whether xi (x) is closer to y2 or y1. Similar to Example 1, they serve as important quantities determining the structure of Gi(x)/ x (and therefore the optimal solution). One can check that Φ(y1, y2, x, xi) = Φ(y2, y1, x, xi). Therefore,

x = Ey1,y2 [Φ(y1, y2, x, xi) | x, xi, (y1 y2)] .

Central to our analysis is carefully partitioning the event y1 y2 into four disjoint (sub)-events. Under each event, the conditional expectation of Φ can be computed in a straightforward manner. Speciﬁcally, deﬁne E1: when x y1 y2 or y1 y2 x. Thus, Pr[E1 | y1 y2] = x2+1

2 . E2: when y1 < x and y2 1 + 2xi. Thus, Pr[E2 | y1 y2] = (x + 1)xi. E3: when y1 < x and x < y2 < 2xi x. Thus, Pr[E3 | y1 y2] = (x + 1)(xi x). E4: when y1 < x and 2xi x y2 < 1 + 2xi. Thus, Pr[E4 | y1 y2] = (x+1)2

2 . Figure 3(a) in Appendix A visualizes the events to complement the analysis. We now interpret the meaning of these events. Event E1. Event E1 represents the case in which y1 and y2 are on the same side of x. In this case, any movement of x without passing y1, y2 will not change the probability that y1 i y2 occurs (i.e., Pr[y1 i y2 | y1, y2, x, E1] = Pr[y1 i y2 | y1, y2, x δ, E1] for any sufﬁciently small δ). Therefore, E1 does not impact E[Φ] and E[Φ | E1, x, xi] = 0. Event E2. Under this event, one can check that |xi y1| < |xi y2| (i.e., xi is closer to y1 than to y2). In this case, when we increase the value of x (recall that y1 < x < xi), NK(R{y1,y2} i , R{y1,y2} x ) will increase. This is equivalent to E[Φ | E2, x, xi] > 0. Event E3. Under this event, one can check that |xi y1| > |xi y2|.In this case, when we increase the value of x, NK(R{y1,y2} i , R{y1,y2} x ) will decrease. This is equivalent to E[Φ | x, xi, E3] < 0. Event E4. Under this event |xi y1| |xi y2| is positive (we call this a positive event) with probability 0.5 and is negative

Figure 1: (a) Functions Gt(x) for c = 1, and xt = 1, 0.4, and 1. Observations: (i) when xt = 1, the function Gt(x) is a monotone function; (ii) for all xt, Gt(x) grows in sublinear manner when x is close to 1. (b) Deﬁnition of I1, I2 and I3 used in the lower bound proof for Lemma 3 (Section 4). Agents in I1 and I3 are effective anchor agents. (c) Intuition of E2 and E3 in the proof of Lemma 4 (Section 4).

(we call this a negative event) with probability 0.5. The ﬁrst order term conditioned under the positive event cancels out that conditioned under the negative event. Therefore, E[Φ | x, xi, E4] will be a small term. With the above intuition, we have

E[Φ | x, xi] E[Φ | x, xi, E2] Pr[E2 | x, xi] + E[Φ | x, xi, E3] Pr[E3 | x, xi].

We now relate E2 and E3 to Example 1. Event E2 corresponds to the setting where xi is closer to y1 than to y2. As explained in Example 1, when we move x to the right, we increase the expected KT distance, whereas in E3 when we move x to the right, we decrease the expected KT distance. The crucial observation is that E2 is much more likely to happen than E3. The observation becomes clear with visualization in Fig. 3b (i.e., the area for E2 is much larger than that for E3). Using these intuitions, we have (see Appendix B.2 for the full analysis):

Fact 1. Using notations above, we have ,

E[Φ | x, xi] = X

t=2,3,4 E[Φ | x, xi, Et] Pr[Et | x, xi] > 0.

This completes the proof of Lemma 2.

4 Anchor-based nearest neighbor algorithms

This section develops a new (high-dimensional) feature Fi for each agent i so that | Fi Fj| is small if and only if xi and xj are close. We present the positive results in the most general form in 1-dim latent space (i.e., we allow DX and DY to be any near-uniform distribution, p = o(1), and the latent space is [ c, c] for any constant c). Index convention. This section uses i, j, and t to index agents and ℓ(include ℓ1, ℓ2, etc.) to index alternatives.

Intuition of the design of Fi. Our key idea is to leverage a third agent, namely an anchor agent, to determine the closeness of two agents xi and xj. Let xt be a third agent. Compute NK(Rt, Ri) and NK(Rt, Rj). If xt were chosen appropriately, then NK(Rt, Ri) NK(Rt, Rj) if and only if xi and xj are close. For example, if xt = c, then NK(Rt, Ri) Gt(xi) and the function Gt(x) is a monotone function (see the blue curve in Fig. 1a). We have Gt(xi) Gt(xj) if and only if xi xj. We face two key technical challenges: C1. Not all agents can be served as effective anchor agents. For example, when

xt = 0, Gt(x) = Gt( x), which cannot separate xi and xj when xi = xj. C2. the efﬁcacy of the anchor agent is sensitive to xi and xj. For example, when xt = c (see again Fig. 1a), the function Gt(x) grows in a sub-linear manner so it is less effective in detecting the closeness of xi and xj when they are both close to c. To address C1, we use all possible agents as anchor agents. To address C2, we need to develop new probabilistic techniques to analyze Gt(x).

Our features. Let Fi = (Fi,1, Fi,2, Fi,n), where

Fi,t ERi,Rt,Y[NK(Ri, Rt) | X] = Gt(xi).

Then we use the L1-distance between Fi and Fj to determine whether xi and xj are close. Deﬁne

D(xi, xj) 1 n 2

t {i,j} |Fi,t Fj,t|. (5)

The summation excludes i and j because xi and xj themselves cannot be used as anchor agents.

Replacing the features F by estimates. Fi,t is not directly given so we use the empirical estimate as the plug-in estimator. Deﬁne ˆFi,t = NK(ROi i , ROt t ) and our estimator is ˆD(xi, xj) = 1 n 2 P

t {i,j} | ˆFi,t ˆFj,t| (see Algorithm 2).

Algorithm 2: Anchor-k NN

1 Input: {ROj j }j [n], k, and agent xi.

2 Output: k neighbors near agent i.

3 Compute ˆFi,j = NK(ROi i , ROj j ) for all i, j [n].

4 Compute ˆD(xi, xj) = 1 n 2 P

t =i,j | ˆFi,t ˆFj,t|.

5 Find j1, j2, , jn 1 ( [n]/{i}) such that ˆD(xi, xj1) ˆD(xi, xj2) ˆD(xi, xjn 1).

6 Return XAnchor-k NN {j1, , jk}.

Theorem 2. Consider Algorithm Anchor-k NN under distance-based random preference model. Let DX and DY be any near-uniform distribution on [ c, c]. Let τ(n) be an arbitrary quality parameter so that τ(n) = o(1)

τ(n) = ω n 1

ln n . For any xi [ c, c], any m =

ω ln3 n p2 τ 4(n) ln2 ln3 n p2 τ 4(n) and any k = o(n τ 2(n)

ln 1 n), we have

max x Anchor-k NN({R Oj j }j,k,xi) ||x xi||2 τ(n)

with high probability. The probability comes from random X/{xi}, random Y, and random preferences.

Remark. τ(n) cannot be too small because our function D(xi, xj) cannot measure the distance of two agents well if they are too close. m needs to grow when p (fewer samples) or τ(n) decreases (higher quality requirement), which is intuitive. k measures the number of numbers an algorithm can ﬁnd so that their distance is within τ(n); larger k means the algorithm is more powerful. Let D(xi, xj) = E[D(xi, xj)]. In the remainder of this section, we analyze the behavior of of D. The function ˆD concentrates at D and can be shown by using simple Chernoff bounds (see Appendix C.2 for a complete analysis).

Lemma 3. For any near-uniform DX, DY on [ c, c] and any two agents xi, xj, we have

c3(c) ln 1 n |xi xj|2 D(xi, xj) |xi xj|,

where c3 is a constant that depends only on c.

Upper bound proof for Lemma 3. The upper bound requires only a straightforward calculation. Recall that pi = pi(y1, y2) = Pr[y1 i y2 | y1, y2, xi]. We have

D(xi, xj) = Ext Ey1,y2[(pi pj)(1 2pt) | xi, xj, xt]

Ey1,y2 [|pi pj| | xi, xj] |xi xj|. (6)

The last inequality is shown in Fact 6 in Appendix C.1.

Lower bound bound proof for Lemma 3. Here we analyze only the case |xi xj| 2c 2 ln n. When |xi xj| > 2c 2 ln n (e.g., xi and xj are around the boundaries c and c, respectively), the result is trivial. Wlog, assume that xi < xj. We partition [ c, c] into three intervals and consider anchor agents in each of these intervals. Speciﬁcally, deﬁne (see also Figure 1(b))

I1 h c, xi c

i , I2 xi c

& I3 h xj + c

The agents in I2 are less effective anchors (C2). We use trivial bound for terms in I2 (|Fi,t Fj,t| 0). Focusing on I1 and I3, we have

4c c X Ext Gt(xi) Gt(xj) | xt I1, xi, xj +

c xj 4c c X Ext Gt(xi) Gt(xj) | xt I3, xi, xj .

Note that xi+c

4c c X + c xj

4c c X is at least Ω(1). Now we show that Ext Gt(xi) Gt(xj) | xt I1, xi, xj is at least in the order of |xi xj|2. The analysis for the other term is similar. Below is the lemma we need (related to C2):

Lemma 4. For any near-uniform DX, DY on [ c, c] such that |xi xj| 2c 2 ln n and xt I1, we have

|Gt(xi) Gt(xj)| c4(c) |xi xj|2,

where c4(c) = 1 e c/4

1+e c/4 1 96c2 cosh2(c) c2 Y (0, 1

Proof of Lemma 4. Note that |xi xj| 2c 2 ln n and xt I1 imply xj xi 2xt + c. Because

|Gt(xi) Gt(xj)| = R xj xi Gt(x)

x dx , we aim to give a

bound for Gt(x)

x . Speciﬁcally,

x 3c4(c) (c2 x2).

when x 2xt + c (or equivalently, xt x c

2 ). Re-cycle the deﬁnition of Φ, (i) 1,2, and (x) 1,2 used in the analysis of Theorem 1 (see also Appendix A for the notation summary) so that G(x)

x = Ey1<y2[Φ(y1, y2, x, xt) | x, xt] We partition the positions of {y1, y2} into events (see also Fig. 1c for a visualization):

E1: when x y1 y2 or y1 y2 x. E2: when y1 [ x 7c

8 ] and y2 x+c

2 . We have Pr[E2 | y1 < y2] c2 x2

E3: when (y1, y2) / E1 E2. As explained below, we do not need to explicitly calculate Pr[E3 | y1 y2].

We now explain the intuition associated with these events.

Event E1. Because y1 and y2 are on the same side of x, we have E[Φ | E1, x, xt] = 0 (see also Lemma 2).

Event E2 and event E3. When (y1, y2) E2 E3, we have |y1 xt| |y2 xt| (xt is closer to y1 than to y2). This is because (i) |y2 xt| x xt when y2 x and x I1, and (ii) |y1 xt| max{xt c, x xt} x xt since y1 x and x 2xt + c. This conclusion is trivial when the positions of the points are visualized (Figure 1(c)). In these events, an increment in x will result in an increment in Gt(x). Therefore E[Φ | E2 E3, x, xt] 0.

Event E2. Knowing Φ can be arbitrarily close to 0 when E2 E3 happens. Here, we need also identify an event so that Φ is at least a positive constant. Event E2 serves for this purpose because it ensures x to be far from the mid-point of y1 and y2, so G(x) behaves like a linear function in this region (and thus its derivative behaves like a constant).

Intuition on dependencies on |xi xj|2. We now explain why |Gt(xi) Gt(xj)| depends on |xi xj|2 instead of |xi xj|. Only if (xi, xj) / E1, Φ will be non-zero. This requires y1 x y2. Recall we only interests to xi x xj. When xi and xj get too close to c, Pr[y1 < x < y2 | x] = Θ(xi + c) Θ(|xi xj|). When we carry out integration over Gt(x)/ x, this term will lead to an additional factor of |xi xj| (see also Figure 1(a) for simulated Gt(x)).

Figure 2: (a) Average pairwise probability prediction error (or pairwise error, see Appendix E.1 for the deﬁnition) for different k [101, 1601]. (b) Pairwise error when k = 751 (optimal k for ground-truth k NN). (c) The average latent distance between xi and its predicted 751 near neighbors. (d) Validation for high dimension (1 10 dimensional latent space). Let k = 73 ln2 n and measure the pairwise errors when k = 73. (e) The average latent distance when k = 73.

Using the above intuition, we have

Ey1<y2[Φ | x, xt] Ey1<y2[Φ | E2, x, xt] Pr y1<y2[E2 | x, xt]

= Ω(1) c2 x2

The last inequality uses the fact that Ey1 y2[Φ | E2, x, xi] = Ω(1) (Fact 7 in Appendix C). By carrying out an integration over Gt(x), we get |Gt(xi) Gt(xj)| c4(c) |xi xj|2.

Lemma 4 and its similar result for I3 sufﬁce to establish the lower bound.

5 Experiments Our experiments aim to (i) conﬁrm the behaviors of KTk NN and Anchor-k NN for ﬁnite sample size when d = 1, (ii) understand the behavior of Anchor-k NN in high-dim settings, and (iii) validate the practicality of our algorithm over real-world datasets. Details of all experiments are in Appendix E.1. Note that this is a theoretical paper. Extensive evaluations on realworld data are an promising direction for future work. 1-dim synthetic data. In this experiment we compare the efﬁcacy of KT-k NN, Anchor-k NN, and Ground-Truthk NN. Ground-Truth-k NN assumes an oracle access to an input s (ground-truth) k-nearest neighbors so this is an optimal k NN. Figure 2(a) plots the performance of Anchork NN,KT-k NN and Ground-truth-k NN in completing the preferences for different k. Figure 2(b) plots the completion/prediction error for different xi using an optimal k (chosen by using cross-validations for Ground-Truth-k NN). Figure 2(c) shows the average latent distance between the return set and xi. The experiments conﬁrm that (i) Anchor-k NN is consistently better, (ii) KT-k NN does not return near neighbors when xi c

2, and (iii) when xi c

2, KT-k NN is the poorest at completing preferences. Appendix E.1 provides additional experiments for different settings on m. High-dim synthetic data. We repeat the experiments for d = 1, , 10. Figure 2(d) is the prediction error and Figure 2(e) is the average distance. Anchor-k NN continues to outperform KT-k NN, suggesting that our algorithm works

for d > 1. The performance of all algorithms deteriorate for large d because of the curse-of-dimensionality problem in neighbor-based algorithms (?). Real dataset. We examine the performance of Anchork NN using the standard Netﬂix dataset (?; ?). The baselines are KT-k NN and a standard collaborative ﬁltering algorithm using cosine-similarity (?; ?). Table 2 in Supplementary materials presents the results. Anchor-k NN consistently outperforms KT-k NN and collaborative ﬁltering using cosine-similarity. Our experiments suggests that our distance-based random preference model and Anchor-k NN seem to be quite practical.

6 Conclusion This paper introduced a natural learning-to-rank model, and showed that under this model a widely used KT-distance based k NN algorithm failed to ﬁnd similar agents (users), challenging the assumptions made in many prior preference completion algorithms. To ﬁx the problem, we introduced a new anchor-based algorithm Anchor-k NN that uses all the agents ranking data to determine the closeness of two agents. Our approach is in sharp contrast to most existing feature engineering methods. We provided a rigorous analysis for Anchor-k NN for 1-dim latent space, and performed experiments on both synthetic and real datasets. Our experiments showed that Anchor-k NN works in high dim space and promises to outperform other widely used techniques.

7 Acknowledgments We thank all anonymous reviewers for helpful comments and suggestions. AL and LX are supported by NSF #1453542 and ONR #N00014-17-1-2621. QW and ZL are supported by NSF CRII:III #1755769.