# preference_completion_from_partial_rankings__c86da5d5.pdf Preference Completion from Partial Rankings Suriya Gunasekar University of Texas, Austin, TX, USA suriya@utexas.edu Oluwasanmi Koyejo University of Illinois, Urbana-Champaign, IL, USA sanmi@illinois.edu Joydeep Ghosh University of Texas,Austin, TX, USA ghosh@ece.utexas.edu We propose a novel and efficient algorithm for the collaborative preference completion problem, which involves jointly estimating individualized rankings for a set of entities over a shared set of items, based on a limited number of observed affinity values. Our approach exploits the observation that while preferences are often recorded as numerical scores, the predictive quantity of interest is the underlying rankings. Thus, attempts to closely match the recorded scores may lead to overfitting and impair generalization performance. Instead, we propose an estimator that directly fits the underlying preference order, combined with nuclear norm constraints to encourage low rank parameters. Besides (approximate) correctness of the ranking order, the proposed estimator makes no generative assumption on the numerical scores of the observations. One consequence is that the proposed estimator can fit any consistent partial ranking over a subset of the items represented as a directed acyclic graph (DAG), generalizing standard techniques that can only fit preference scores. Despite this generality, for supervision representing total or blockwise total orders, the computational complexity of our algorithm is within a log factor of the standard algorithms for nuclear norm regularization based estimates for matrix completion. We further show promising empirical results for a novel and challenging application of collaboratively ranking of the associations between brain regions and cognitive neuroscience terms. 1 Introduction Collaborative preference completion is the task of jointly learning bipartite (or dyadic) preferences of set of entities for a shared list of items, e.g., user item interactions in a recommender system [14; 22]. It is commonly assumed that such entity item preferences are generated from a small number of latent or hidden factors, or equivalently, the underlying preference value matrix is assumed to be low rank. Further, if the observed affinity scores from various explicit and implicit feedback are treated as exact (or mildly perturbed) entries of the unobserved preference value matrix, then the preference completion task naturally fits in the framework of low rank matrix completion [22; 38]. More generally, low rank matrix completion involves predicting the missing entries of a low rank matrix from a vanishing fraction of its entries observed through a noisy channel. Several low rank matrix completion estimators and algorithms have been developed in the literature, many with strong theoretical guarantees and empirical performance [6; 32; 21; 28; 38; 10]. Recent research in the preference completion literature have noted that using a matrix completion estimator for collaborative preference estimation may be misguided [11; 33; 23] as the observed entity item affinity scores from implicit/explicit feedback are potentially subject to systematic monotonic transformations arising from limitations in feedback collection, e.g., quantization and 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. inherent biases. While simple user biases and linear transofmations can be handled within a low rank matrix framework, more complex transformations like quantization can potentially increase the rank of the observed preference score matrix significantly, thus adversely affecting recovery using standard low rank matrix completion [13]. Further, despite the common practice of measuring preferences using numerical scores, predictions are most often deployed or evaluated based on the item ranking e.g. in recommender systems, user recommendations are often presented as a ranked list of items without the underlying scores. Indeed several authors have shown that favorable empirical/theoretical performance in mean square error for the preference matrix often does not translate to better performance when performance is measured using ranking metrics [11; 33; 23]. Thus, collaborative preference estimation may be better posed as a collection of coupled learning to rank (LETOR) problems [25], where we seek to jointly learn the preference rankings of a set of entities, by exploiting the low dimensional latent structure of the underlying preference values. This paper considers preference completion in a general collaborative LETOR setting. Importantly, while the observations are assumed to be reliable indicators for relative preference ranking, their numerical scores may be quite deviant from the ground truth low rank preference matrix. Therefore, we aim at addressing preference completion under the following generalizations: 1. In a simple setting, for each entity, a score vector representing the its observed affinity interactions is assumed to be generated from an arbitrary monotonic transformation of the corresponding entries of the ground truth preference matrix. We make no further generative assumptions on observed scores beyond monotonicity with respect to the underlying low rank preference matrix. 2. We also consider a more general setting, where observed preferences of each entity represent specifications of a partial ranking in the form of a directed acyclic graph (DAG) the nodes represent a subset of items, and each edge represents a strict ordering between a pair of nodes. Such rankings may be encountered when the preference scores are consolidated from multiple sources of feedback, e.g., comparative feedback (pairwise or listwise) solicited for independent subsets of items. This generalized setting cannot be handled by standard matrix completion without some way of transforming the DAG orderings into a score vector. Our work is in part motivated by an application to neuroimaging meta-analysis as outlined in the following. Cognitive neuroscience aims to quantify the link between brain function with behavior. This interaction is most often measured in humans using Functional Magnetic Resonance Imaging (f MRI) experiments that measure brain activity in response to behavioral tasks. After analysis, the conclusions are often summarized in neuroscience publications which include a table of brain locations that are most actively activated in response to an experimental stimulus. These results can then be synthesized using meta-analysis techniques to derive accurate predictions of brain activity associated with cognitive terms (also known as forward inference) and prediction of cognitive terms associated with brain regions (also known as reverse inference). For our study, we used data from neurosynth [36] - a public repository1 which automatically scrapes information on published associations between brain regions and terms in cognitive neuroscience experiments. The key contributions of the paper are summarized below. We propose a convex estimator for low rank preference completion using limited supervision, addressing: (a) arbitrary monotonic transformations of preference scores; and (b) partial rankings over items (Section 3.1). We derive generalization error bounds for a surrogate ranking loss that quantifies the trade off between data fit and regularization (Section 5). We propose efficient algorithms for the estimate under total and partially ordered observations. In the case of total orders, in spite of increased generality, the computational complexity of our algorithm is within a log factor of the standard convex algorithms for matrix completion (Section 4). The proposed algorithm is evaluated for a novel application of identifying associations between brain regions and cognitive terms from the neurosynth dataset [37] (Section 6). Such a large scale meta-analysis synthesizing information from the literature and related tasks has the potential to lead to novel insights into the role of brain regions in cognition and behavior. 1.1 Notation For a matrix M Rd1 d2, let σ1 σ2 . . . be singular values of M. Then, nuclear norm M = P i σi, operator norm M op = σ1, and Frobenius norm M F = p P i σ2 i . Let 1http://neurosynth.org/ [N] = {1, 2, . . . , N}. A vector or a set x indexed by j [N] is sometimes denoted as (xj)N j=1 or simply (xj) whenever N is unambiguous. Let Ω [d1] [d2] denote a subset of indices of a matrix in Rd1 d2. For j [d2], let Ωj = {(i , j ) Ω: j = j} Ωdenotes the subset of entries in Ω from the jth column. Given Ω= {(is, js) : s = 1, 2, . . . , |Ω|}, PΩ: X (Xisjs)|Ω| s=1 R|Ω| is the linear subsampling operator, and P Ω: R|Ω| Rd1 d2 is its adjoint, i.e y, PΩ(X) = X, P Ω(y) . For conciseness, we sometimes use the notation XΩto denote PΩ(X). 2 Related Work Matrix Completion: Low rank matrix completion has an extensive literature; a few examples include [22; 6; 21; 28] among several others. However, the bulk of these works including those in the context of ranking/recommendation applications focus on (a) fitting the observed numerical scores using squared loss, and (b) evaluating the results on parameter/rating recovery metrics such as root mean squared error (RMSE). The shortcomings of such estimators and results using squared loss in ranking applications have been studied in some recent research [12; 11]. Motivated by collaborative ranking applications, there has been growing interest in addressing matrix completion within an explicit LETOR framework. Weimer et al. [35] and Koyejo et al. [23] propose estimators that involve non convex optimization problems and their algorithmic convergence and generalization behavior are not well understood. Some recent works provide parameter recovery guarantees for pairwise/listwise ranking observations under specific probabilistic distributional assumptions on the observed rankings [31; 26; 29]. In comparison, the estimators and algorithms in this paper are agnostic to the generative distribution, and hence have much wider applicability. Learning to rank (LETOR): LETOR is a structured prediction task of rank ordering relevance of a list of items as a function of pre selected features [25]. Currently, leading algorithms for LETOR are listwise methods [9] (as is the approach taken in this paper), which fully exploit the ranking structure of ordered observations, and offer better modeling flexibility compared to the pointwise [24] and pairwise methods [16; 18]. A recent listwise LETOR algorithm proposed the idea of monotone retargeting (MR) [2], which elegantly addresses listwise learning to rank (LETOR) task while maintaining the relative simplicity and scalability of pointwise estimation. MR was further extended to incorporate margins in the margin equipped monotonic retargeting (MEMR) formulation [1] to preclude trivial solutions that arise from scale invariance of the initial MR estimate in Acharyya et al. [2]. The estimator proposed in the paper is inspired from the the idea of MR and will be revisited later in the paper. In collaborative preference completion, rather than learning a functional mapping from features to ranking, we seek to exploit the low rank structure in jointly modeling the preferences of a collection of entities without access to preference indicative features. Single Index Models (SIMs) Finally, literature on monotonic single index models (SIMs) also considers estimation under unknown monotonic transformations [17; 20]. However, algorithms for SIMs are designed to solve a harder problem of exactly estimating the non parametric monotonic transformation and are evaluated for parameter recovery rather than the ranking performance. In general, with no further assumptions, sample complexity of SIM estimators lends them unsuitable for high dimensional estimation. The existing high dimensional estimators for learning SIMs typically assume Lipschitz continuity of the monotonic transformation which explicitly uses the observed score values in bounding the Lipsciptz constant of the monotonic transformation [19; 13]. In comparison, our proposed model is completely agnostic to the numerical values of the preference scores. 3 Preference Completion from Partial Rankings Let the unobserved true preference scores of d2 entities for d1 items be denoted by a rank r min {d1, d2} matrix Θ Rd1 d2. For each entity j [d2], we observe a partial or total ordering of preferences for a subset of items denoted by Ij [d1]. Let nj = |Ij| denotes the number of items over which relative preferences of entity j are observed, so that Ωj = {(i, j) : i Ij} denotes the entity-item index set for j, and Ω= S j Ωj denotes the index set collected across entities. Let PΩ denote the sampling distribution for Ω. The observed preferences of entity j are typically represented by a listwise preference score vector y(j) Rnj. j [d2], y(j) = gj(PΩj(Θ + W)), (1) where each (gj) are an arbitrary and unknown monotonic transformations, and W Rd1 d2 is some non adversarial noise matrix sampled from the distribution PW . The preference completion task is to estimate a unseen rankings within each column of Θ from a subset of orderings (Ωj, y(j))j [d2]. As (gj) are arbitrary, the exact values of (y(j)) are inconsequential, and the observed preference order can be specified by a constraint set parameterized by a margin parameter ϵ as follows: Definition 1 (ϵ margin Isotonic Set) The following set of vectors are isotonic to y Rn with an ϵ > 0 margin parameter: Rn ϵ(y) = {x Rn : i, k [n], yi < yk xi xk ϵ}. In addition to score vectors, isotonic sets of the form Rn ϵ(y) are equivalently defined for any DAG y = G([n], E) which denotes a partial ranking among the vertices, with the convention that (i, k) E x Rn ϵ(y), xi xk ϵ. We note from Definition 1 that ties are not broken at random, e.g., if yi1 = yi2 < yk, then x Rn ϵ(y), xi1 xk ϵ, xi2 xk ϵ, but no particular ordering between xi1 and xi2 is specified. Let y(k) denote the kth smallest entry of y Rn. We distinguish between three special cases of an observation y representing a partial ranking over [n]. (A) Strict Total Order: y(1) < y(2) < . . . < y(n). (B) Blockwise Total Order: y(1) y(2) . . . y(n), with K n unique values. (C) Arbitrary DAG: Partial order induced by a DAG y = G([n], E). 3.1 Monotone Retargeted Low Rank Estimator Consider any scalable pointwise learning algorithm that fits a model to exact preferences scores. Since no generative model (besides monotonicity) is assumed for the raw numerical scores in the observations, in principle, the scores y(j) for entity j can be replaced or retargeted to any rankingpreserving scores, i.e., by any vector in Rnj ϵ (y(j)). Monotone Retargeting (MR) [2] exploits this observation to address the combinatorial listwise ranking problem [25] while maintaining the relative simplicity and scalability of pointwise estimates (regression). The key idea in MR is to alternately fit a pointwise algorithm to current relevance scores, and retarget the scores by searching over the space of all monotonic transformations of the scores. Our approach extends and generalizes monotone retargeting for the preference prediction task. We begin by motivating an algorithm for the noise free setting, where it is clear that Θ Ωj Rnj ϵ (y(j)), so we seek to estimate a candidate preference matrix X that is in the intersection of (a) the data constraints from the observed preference rankings {XΩj Rnj ϵ (y(j))}, and (b) the model constraints in this case low rankness induced by constraining the nuclear norm X . For robust estimation in the presence of noise, we may extend the noise free approach by incorporating a soft penalty on constraint violations. Let z R|Ω|, and with slight abuse of notation, let zΩj Rnj denote vector of the entries of z R|Ω| corresponding to Ωj Ω. Upon incorporating the soft penalties, the monotone retargeted low rank estimator is given by: b X = Argmin X min z R|Ω|λ X + 1 2 z PΩ(X) 2 2 s.t. j, zΩj Rnj ϵ (y(j)), (2) where the parameter λ controls the trade off between nuclear norm regularization and data fit, and b X is the set of minimizers of (2). We note that Rn ϵ(y) is convex, and λ 1, the scaling λRn ϵ(y) = {λx x Rn ϵ(y)} Rn ϵ(y). The above estimate can be computed using efficient convex optimization algorithms and can handle arbitrary monotonic transformation of the preference scores, thus providing higher flexibility compared to the standard matrix completion. Although (2) is specified in terms of two parameters, due to the geometry of the problem, it turns out that λ and ϵ are not jointly identifiable, as discussed in the following proposition. Proposition 1 The optimization in (2) is jointly convex in (X, z). Further, γ > 0, (λ, γϵ) and (γ 1λ, ϵ) lead to equivalent estimators, specifically b X(λ, γϵ) = γ 1 b X(γ 1λ, ϵ). Since, positive scaling of b X preserves the resultant preference order, using Proposition 1 without loss of generality, only one of ϵ or λ requires tuning with the other remaining fixed. 4 Optimization Algorithm The optimization problem in (2) is jointly convex in (X, z). Further, we later show that the proximal operator of the non differential component of the estimate λ X + P j I(zΩj Rnj ϵ (y(j))) is efficiently computable. This motivates using the proximal gradient descent algorithm [30] to jointly update (X, z). For an appropriate step size α = 1/2 and the resulting updates are as follows: X Update: Singular Value Thresholding The proximal operator for τ . is the singular value thresholding operator Sτ. For X with singular value decomposition X = UΣV and τ 0, Sτ(X) = Usτ(Σ)V, where sτ is the soft thresholding operator given by sτ(x)i = max{xi τ, 0}. z Update: Parallel Projections For hard constraints on z, the proximal operator at v is the Euclidean projection on the constraints given by z argminz z v 2 2, s.t. zΩj Rnj ϵ (y(j)) j [d2]. These updates decouple along each entity (column) zΩj and can be trivially parallelized. Efficient projections onto Rnj ϵ (y(j)) are discussed Section 4.1. Algorithm 1 Proximal Gradient Descent for (2) with input Ω, {y(j) j }, ϵ and paramter λ for k = 0, 1, 2, . . . , Until (stopping criterion) X(k+1) =Sλ/2 X(k) + 1 2(P Ω(z(k) X(k) Ω) , (3) j, z(k+1) Ωj = Proj R nj ϵ (yj) z(k) Ωj +X(k) Ωj 2 . (4) 4.1 Projection onto Rn ϵ(y) We begin with the following definitions that are used in characterizing Rn ϵ(y). Definition 2 (Adjacent difference operator) The adjacent difference operator in Rn, denoted by Dn : Rn Rn 1 is defined as (Dnx)i = xi xi+1, for i [n 1]. Definition 3 (Incidence Matrix) For a directed graph G(V, E), the incidence matrix AG R|V | |E| is such that: if the jth directed edge ej E is from ith node to kth node, then (AG)ij = 1, (AG)kj = 1, and (AG)lj = 0, l = i or k. Projection onto Rn ϵ(y) is closely related to the isotonic regression problem of finding a univariate least squares fit under consistent order constraints (without margins). This isotonic regression problem in Rn can be solved exactly in O(n) complexity using the classical Pool of Adjacent Violators (PAV) algorithm [15; 4] as: PAV(v) = argmin z Rn ||z v||2 s.t. z i z i+1 0. (5) As we discuss, simple adaptations of isotonic regression can be used for projection onto ϵ-margin isotonic sets for the three special cases of interest as summarized in Table 1. (A) Strict Total Order: y(1) < y(2) < . . . y(n) In this setting, the constraint set can be characterized as Rn ϵ(y) = {x : Dnx ϵ1}, where 1 is a vector of ones. For this case projection onto Rn ϵ(y) differs from (5) only in requiring an ϵ separation and a straight forward extension of the PAV algorithm [4] can be used. Let dsl Rn be any vector such that 1 = Dndsl, then by simple substitutions, Proj Rn ϵ(y)(x) = PAV(x ϵdsl) + ϵdsl. (B) Blockwise Total Order: y(1) y(2) . . . y(n) This is a common setting for supervision in many preference completion applications, where the listwise ranking preferences obtained from ratings over discrete quantized levels 1, 2, . . . , K, with K n are prevalent. Let y be partitioned into K n blocks P = {P1, P2, . . . PK}, such that the entries of y within each partition are equal, and the blocks themselves are strictly ordered, i.e., k [K], sup y(Pk 1)< inf y(Pk) = sup y(Pk) < inf y(Pk+1), where P0 = PK+1 = φ, and y(P) = {yi : i P}. Let dbl Rn be such that dbl i = PK k=1 k Ii Pk is a vector of block indices dbl = [1, 1, .. 2, 2, .. K, K, .., K] . Let ΠP be a set of valid permutations that permute entries only within blocks {Pk P}, then Rn ϵ(y) = {x: π ΠP , Dnπ(x) ϵDndbl}. We propose the following steps to compute bz = Proj Rn ϵ(y)(x) in this case: Step 1. π (x) s.t. k [K], π (x)Pk = sort(x Pk) Step 2. bz = PAV (π (x) ϵdbl) + ϵdbl. (6) The correctness of (6) is summarized by the following Lemma. Lemma 2 Estimate bz from (6) is the unique minimizer for argmin z z x 2 2 s.t. π ΠP : Dnπ(z) ϵDndbl. (C) Arbitrary DAG: y = G([n], E) An arbitrary DAG (not necessarily connected) can be used to represent any consistent order constraints over its vertices, e.g., partial rankings consolidated from multiple listwise/pairwise scores. In this case, the ϵ margin isotonic set is given by Rn ϵ(y) = {x : A G x ϵ1} (c.f. Definition 3). Consider d DAG Rn such that ith entry d DAG i is the length of the longest directed chain connecting the topological descendants of the node i. It can be easily verified that, the isotonic regression algorithm for arbitrary DAGs applied on x ϵd DAG gives the projection onto Rn ϵ(y). In this most general setting, the best isotonic regression algorithm for exact solution requires O(nm2 + n3 log n2) computation [34], where m is the number of edges in G. While even in the best case of m = o(n), the computation can be prohibitive, we include this case for completeness. We also note that this case of partial DAG ordering cannot be handled in the standard matrix completion setting without consolidating the partial ranks to total order. Rn ϵ(y) Proj Rn ϵ(y)(x) Computation (A) {x : Dnx ϵ1} PAV(x ϵdsl) + ϵdsl O(n) (B) {x: π ΠP , Dnπ(x) ϵ1} π 1 P PAV(π P (x) ϵdbl) + ϵdbl) O(n log n) (C) {x : A G x ϵ1} Iso Reg(x ϵd DAG, G)+ϵd DAG[34] O(n2m + n3 log n) Table 1: Summary of algorithms for Proj Rn ϵ(y)(x) 4.2 Computational Complexity It can be easily verified that gradient of 1 2 PΩ(X) z 2 2 is 2 Lipschitz continuous. Thus, from standard results on convegence proximal gradient descent [30], Algorithm 1,converges to within an ϵ error in objective in O(1/ϵ) iterations. Compared to proximal algorithms for standard matrix completion [5; 27], the additional complexity in Algorithm 1 arises in the z update (4), which is a simple substitution z(k) = X(k) Ω in standard matrix completion. For total orders, the z update of (4) is highly efficient and is asymptotically within an additional log |Ω| factor of the computational costs for standard matrix completion. 5 Generalization Error Recall that yj are (noisy) partial rankings of subset of items for each user, obtained from gj(Θ j +Wj) where W is a noise matrix and gj are unknown and arbitrary transformations that only preserve that ranking order within each column. The estimator and the algorithms described so far are independent of the sampling distribution generating (Ω, {yj}). In this section we quantify simple generalization error bounds for (2). Assumption 1 (Sampling (PΩ)) For a fixed W and Θ , we assume the following sampling distribution. Let be c0 a fixed constant and R be pre specified parameter denoting the length of single listwise observation. For s = 1, 2, . . . , |S| = c0d2 log d2, j(s) uniform[d2], I(s) randsample([d1], R), Ω(s) = {(i, j(s)) : i I(s)}, y(s) = gj(s)(PΩ(s)(Θ + W)). (7) Further, we define the notation: j, Ij = S s:j(s)=j I(s), Ωj = S s:j(s)=j Ω(s), and nj = |Ωj|. For each column j, the listwise scores {y(s) : j(s) = j} jointly define a consistent partial ranking of Ij as the scores are subsets of a monotonically transformed preference vector gj(Θ j + Wj). This consistent ordering is represented by a DAG y(j) = Partial Order({y(s) : j(s) = j}). We also note that O(d2 log d2) samples ensures that each column is included in the sampling with high probability. Definition 4 (Projection Loss) Let y = G([n], E) or y Rn define a partial ordering or total order in Rn, respectively. We define the following convex surrogate loss over partial rankings. Φ(x, y) = minz Rn ϵ(y) x z 2 Theorem 3 (Generalization Bound) Let b X be an estimate from (2). With appropriate scaling let b X F = 1 , then for constants K1 K2, the following holds with probability greater than 1 δ over all observed rankings {y(j), Ωj : j [d2]} drawn from (7) with |S| c0d2 log d2: Ey(s),Ω(s)Φ( b XΩ(s), y(s)) 1 s=1 Φ( b XΩ(s), y(s)) + K1 b X log1/4 d d1d2 Theorem 3 quantifies the test projection loss over a random R length items I(s) drawn for a random entity/user j(s). The bound provides a trade off between observable training error and complexity defined by nuclear norm of the estimate. 6 Experiments We evaluate our model on two collaborative preference estimation tasks: (a) a standard user-item recommendataion task on a benchmarked dataset from Movielens, and (b) identifying associations between brain regions and cognitive terms using the neurosynth dataset [37]. Baselines: The following baseline models are compared in our experiments: Retargeted Matrix Completion (RMC): the estimator proposed in (2). Standard Matrix Completion (SMC) [8]: We primarily compare our estimator with the standard convex estimator for matrix completion using nuclear norm minimization. Collaborative Filtering Ranking Co Fi-Rank [35]: This work addresses collaborative filtering task in a listwise ranking setting. For SMC and MRPC, the hyperparameters were tuned using grid search on a logarithmic scale. Due to high computational cost with tuning parameters in CofiRank, we use the code and default parameters provided by the authors. Evaluation metrics: The performance on preference estimation tasks are evaluated on four ranking metrics: (a) Normalized Discounted Cummulative Gains (NDCG@N), (b) Precision@N, (c) Spearmann Rho, and (d) Kendall Tau, where the later two metrics measure the correlation of the complete ordering of the list, while the former two metrics primarily evaluate the correctness of ranking in the top of the list (see Liu et. al. [25] for further details on these metrics). Movielens dataset (blockwise total order) Movielens is a movie recommendation website administered by Group Lens Research. We used competitive benchmarked movielens 100K dataset. We used the 5 fold train/test splits provided with the dataset (the test splits are non-overlapping). We discarded a small number of users that had less than 10 ratings in any of 5 training data splits. The resultant dataset consists of 923 users and 1682 items. The ratings are blockwise ordered taking one of 5 values in the set {1, 2, . . . , 5}. During testing, for each user, the competing models return a ranking of the test-items, and the performance is averaged across test-users. Table 2 presents the results of our evaluation averaged across 5 train/test splits on the Movielens dataset, along with the standard deviation. We see that the proposed retargeted matrix completion (RMC) significantly and consistently outperforms SMC and Co Fi-Rank [35] across ranking metrics. NDCG@5 Precision@5 Spearman Rho Kendall Tau RMC 0.7984(0.0213) 0.7546(0.0320) 0.4137(0.0099) 0.3383(0.0117) SMC 0.7863(0.0243) 0.7429(0.0295) 0.3722(0.0106) 0.3031(0.0117) Co Fi-Rank 0.7731(0.0213) 0.7314(0.0293) 0.3681(0.0082) 0.2993(0.0110) Table 2: Ranking performance for recommendations in Movielens 100K. Table shows mean and standard deviation over 5 fold train/test splits. For all reported metrics, higher values are better [25]. Neurosynth Dataset (almost total order) Neurosynth[37] is a publicly available database consisting of data automatically extracted from a large collection of functional magnetic resonance imaging (f MRI) publications (11,362 publications in current version). For each publication , the database contains the abstract text and all reported 3-dimensional peak activation coordinates in the study. The text is pre-processed to remove common stop-words, and any text with less than .1% frequency, leaving a total of 3169 terms. We applied the standard brain map to the activations, removing voxels outside of the grey matter. Next the activations were downsampled from 2mm3 voxels to 10mm3 voxels using the nilearn python package, resulting in a total of 1231 dense voxels. The affinity measure between 3169 terms and 1231 consolidated voxels is obtained by multiplying the term publication and the publication voxels matrices. The resulting data is dense high-rank preference matrix. With very few tied preference values, this setting best fits the case of total ordered observations (case A in Section 4.1). Using this data, we consider the reverse inference task of ranking cognitive concepts (terms) for each brain region (voxel) [37]. Train-val-test: We used 10% of randomly sampled entries of the matrix as test data and a another 10% for validation. We created training datasets at various sample sizes by subsampling from the remaining 80% of the data. This random split is replicated multiple times to obtain 3 bootstrapped datasplits (note that unlike cross validation, the test datasets here can have some overlapping entries). The results in Fig. 1 show that the proposed estimate from (2) outperforms standard matrix completion in terms of popular ranking metrics. Figure 1: Ranking performance for reverse inference in Neurosynth data. x-axis denotes the fraction of the affinity matrix entries used as observations in training. Plots show mean with errorbars for standard deviation over 3 bootstrapped train/test splits. For all the reported ranking metrics, higher values are better[25]. 7 Conclusion Our work addresses the problem of collaboratively ranking; a task of growing importance to modern problems in recommender systems, large scale meta-analysis, and related areas. We proposed a novel convex estimator for collaborative LETOR from sparsely observed preferences, where the observations could be either score vectors representing total order, or more generally directed acyclic graphs representing partial orders. Remarkably, in the case of complete order, the complexity of our algorithm is within a log factor of the state of the art algorithms for standard matrix completion. Our estimator was empirically evaluated on real data experiments. Acknowledgments SG and JG acknowledge funding from NSF grants IIS-1421729 and SCH 1418511. [1] S. Acharyya and J. Ghosh. MEMR: A margin equipped monotone retargeting framework for ranking. In UAI, 2013. [2] S. Acharyya, O. Koyejo, and J. Ghosh. Learning to rank with bregman divergences and monotone retargeting. In UAI, 2012. [3] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 2003. [4] M. J. Best and N. Chakravarti. Active set algorithms for isotonic regression; a unifying framework. Math. Program., 1990. [5] J. F. Cai, E. J. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM J. Optim., 2010. [6] E. J. Candés and Y. Plan. Matrix completion with noise. Proc. IEEE, 2010. [7] E. J. Candés and B. Recht. Exact matrix completion via convex optimization. Fo CM, 2009. [8] E. J. Candés, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory, 2006. [9] Z. Cao, T. Qin, T. Y. Liu, M. F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007. [10] E. Chi, H. Zhou, G. Chen, D. O. Del Vecchyo, and K. Lange. Genotype imputation via matrix completion. Genome Res., 2013. [11] P. Cremonesi, Y. Koren, and R. Turrin. Performance of recommender algorithms on top-n recommendation tasks. In Rec Sys. ACM, 2010. [12] J. C. Duchi, L. W Mackey, and M. I. Jordan. On the consistency of ranking algorithms. In ICML, 2010. [13] R. Ganti, L. Balzano, and R. Willett. Matrix completion under monotonic single index models. In NIPS, 2015. [14] D. Goldberg, D. Nichols, B. Oki, and D. Terry. Using collaborative filtering to weave an information tapestry. Commun. ACM, 1992. [15] S.J. Grotzinger and C. Witzgall. Projections onto order simplexes. Appl. Math. Optim., 1984. [16] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. In NIPS, 1999. [17] J. L. Horowitz. Semiparametric and nonparametric methods in econometrics, volume 12. Springer, 2009. [18] T. Joachims. Optimizing search engines using clickthrough data. In SIGKDD, 2002. [19] S. M. Kakade, V. Kanade, O. Shamir, and A. Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In NIPS, 2011. [20] A. T. Kalai and R. Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009. [21] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. IEEE Transactions on IT, 2010. [22] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 2009. [23] O. Koyejo, S. Acharyya, and J. Ghosh. Retargeted matrix factorization for collaborative filtering. In Rec Sys, 2013. [24] P. Li, Q. Wu, and C. J. Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2007. [25] T. Y. Liu. Learning to rank for information retrieval. Foundations and Trends in IR, 2009. [26] Y. Lu and S. N. Negahban. Individualized rank aggregation using nuclear norm regularization. In Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2015. [27] S. Ma, D. Goldfarb, and L. Chen. Fixed point and bregman iterative methods for matrix rank minimization. Math. Program., 2011. [28] A. Mnih and R. Salakhutdinov. Probabilistic matrix factorization. In NIPS, 2007. [29] S. Oh, K. K. Thekumparampil, and J. Xu. Collaboratively learning preferences from ordinal data. In NIPS, 2015. [30] N. Parikh and S. P. Boyd. Proximal algorithms. Foundations and Trends in optimization, 2014. [31] D. Park, J. Neeman, J. Zhang, S. Sanghavi, and I. Dhillon. Preference completion: Large-scale collaborative ranking from pairwise comparisons. In ICML, 2015. [32] B. Recht. A simpler approach to matrix completion. JMLR, 2011. [33] H. Steck. Training and testing of recommender systems on data missing not at random. In KDD. ACM, 2010. [34] Q. F. Stout. Isotonic regression via partitioning. Algorithmica, 2013. [35] M. Weimer, A. Karatzoglou, Q. V. Le, and A. J. Smola. COFIRANK - maximum margin matrix factorization for collaborative ranking. In NIPS, 2008. [36] T. Yarkoni, R. A. Poldrack, T. E. Nichols, D. C. Van Essen, and T. D. Wager. Large-scale automated synthesis of human functional neuroimaging data. Nat. Methods, 2011. [37] Tal Yarkoni. http://neurosynth.org/. http://neurosynth.org/, 2011. [38] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. In Algorithmic Aspects in Information and Management, LNCS 5034, 2008.