# learning_from_comparisons_and_choices__5f9d3c3e.pdf Journal of Machine Learning Research 19 (2018) 1-95 Submitted 10/17; Revised 7/18; Published 8/18 Learning from Comparisons and Choices Sahand Negahban sahand.negahban@yale.edu Statistics Department Yale University Sewoong Oh swoh@illinois.edu Department of Industrial and Enterprise Systems Engineering University of Illinois at Urbana-Champaign Kiran K. Thekumparampil thekump2@illinois.edu Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign Jiaming Xu xu972@purdue.edu Krannert School of Management Purdue University Editor: Qiang Liu When tracking user-specific online activities, each user s preference is revealed in the form of choices and comparisons. For example, a user s purchase history is a record of her choices, i.e. which item was chosen among a subset of offerings. A user s preferences can be observed either explicitly as in movie ratings or implicitly as in viewing times of news articles. Given such individualized ordinal data in the form of comparisons and choices, we address the problem of collaboratively learning representations of the users and the items. The learned features can be used to predict a user s preference of an unseen item to be used in recommendation systems. This also allows one to compute similarities among users and items to be used for categorization and search. Motivated by the empirical successes of the Multi Nomial Logit (MNL) model in marketing and transportation, and also more recent successes in word embedding and crowdsourced image embedding, we pose this problem as learning the MNL model parameters that best explain the data. We propose a convex relaxation for learning the MNL model, and show that it is minimax optimal up to a logarithmic factor by comparing its performance to a fundamental lower bound. This characterizes the minimax sample complexity of the problem, and proves that the proposed estimator cannot be improved upon other than by a logarithmic factor. Further, the analysis identifies how the accuracy depends on the topology of sampling via the spectrum of the sampling graph. This provides a guideline for designing surveys when one can choose which items are to be compared. This is accompanied by numerical simulations on synthetic and real data sets, confirming our theoretical predictions. Keywords: Collaborative Ranking, Nuclear Norm Minimization, Multi-Nomial Logit Model 1. Introduction Given data on how users compared subsets of items, we address the fundamental problem of learning a representation of users and items. Such data can be observed in the form c 2018 Sahand Negahban, Sewoong Oh, Kiran K. Thekumparampil, and Jiaming Xu. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v19/17-607.html. Negahban, Oh, Thekumparampil, and Xu of choices (e.g. which item was bought) or in the form of comparisons (e.g. which items are rated higher). From such ordinal data on the items, we want to find low dimensional representations, which we call (latent) features, that explain crucial aspects of the users choices. Once learned, these features can be used to predict each user s preference over items that the user has not seen yet, which can be used in recommendation systems and revenue management. These learned features also provide an embedding of the users and items on the same Euclidean space that allows us to directly quantify similarities via distances, that can be used to categorize and cluster. These embeddings can reveal the underlying structure of data such as images. Such an embedding of a discrete set of objects based on ordinal data has recently gained tremendous attraction mainly due to word embeddings based on co-occurrence data and their successes in numerous downstream natural language processing tasks Mikolov et al. (2013b). The fundamental question in such a representation learning is: what makes one representation better than the others? Our guiding principle is that a good representation is the one that defines a generative model that best explains the given data in the maximum likelihood sense. To this end, we focus on a parametric generative model known as Multi Nomial Logit (MNL) model, widely used and studied in revenue management. The MNL model has a natural interpretation of human choices as an outcome of maximizing a utility by agents with noisy perception of the utility, also known as random utility model in Walker and Ben-Akiva (2002); Azari Soufiani et al. (2012), defined as follows. Each user and item has a latent low-dimensional feature ui Rr and vj Rr respectively. The true utility of an item is the inner product of these two features Θij ui, vj = P k uikvjk. The inherent low-rank structure of Θ = [Θij] captures the collaborative nature of the problem, where users with similar preferences in the past are likely to prefer similar items in the future. When presented with a set of items, a user reveals a noisy ordering of the items sorted according to her perceived utilities of the items, each of which is perturbed by an i.i.d. noise added to the true utility Θij. The MNL model is a special case where the noise follows the standard Gumbel distribution, and is one of the most popular models in choice theory for its simplicity and empirical success Mc Fadden (1973); Mc Fadden and Train (2000). The MNL model has several important properties, making this model realistic in various domains, including marketing Guadagni and Little (1983), transportation Mc Fadden (1980); Ben-Akiva and Lerman (1985), biology Sham and Curtis (1995), sports games Tsokos et al. (2018) and natural language processing Mikolov et al. (2013a). The MNL model (i) satisfies the independence of irrelevant alternatives in social choice theory Ray (1973); (ii) has a maximum likelihood estimator (MLE) which is a convex program in Θ; and (iii) has a simple characterization of sequential (random) choices as follows. Let P {a > {b, c, d}} denote the probability a was chosen as the best alternative among the set {a, b, c, d}. Then, the probability that user i reveals a linear order (a > b > c > d) is P {a > {b, c, d}}P {b > {c, d}}P {c > d}, where P {a > {b, c, d}} = eΘia/(eΘia +eΘib +eΘic +eΘid). Essentially the user is modeled as making a sequence of choices, choosing the best alternative first and then making choices on the remaining ones. We give the precise definition of the MNL model in Section 2 for pairwise comparisons and in Section 4 for higher order comparisons and choices. Beyond its success in classical applications such as transportation and marketing, the MNL model and its variants are being rediscovered and successfully applied in more modern applications Learning from Comparisons and Choices such as embedding images using crowdsourcing Tamuz et al. (2011) and word embedding Mikolov et al. (2013b), whose connections we make precise in Section 6. Motivated by recent advances in learning low-rank models, e.g. Negahban et al. (2009); Davenport et al. (2014), we ask the fundamental question of learning the MNL model from data on comparisons and choices. We provide a general framework using convex relaxations for learning the model. As data is collected in various forms on modern social computing systems, we consider the following four canonical scenarios: Pairwise comparisons. The most simple and canonical piece of ordinal data one can collect from a user at a time is a pairwise comparison; given two options, we ask the user which one is better. Such data is prevalent in the real world and is the most popular scenario studied in ranking literature, e.g. Shah et al. (2014). However, one significant aspect of the real data that has not been addressed in the literature is irregularities in the sampling. Consider an online seller with various products, say cars and watches. It does not make sense to ask a user to compare a car and a watch; one cannot sample an outcome of a comparison between a watch and a car. However, knowing a user s preference on cars can help in learning her preference on watches. We want to propose a model and design an inference algorithm that can take into account such restrictions in sampling. We further want to quantify the gain in using all such data together in inference, as opposed to running inference in each category separately. To this end, we propose a new model for sampling that we call graph sampling. This model explains such irregularities in the real world data. We propose a novel inference algorithm tailored for the given sampling pattern. Our analysis captures precisely how the accuracy depends on the different topologies of the sampling. Higher order comparisons. Consider an online market that collects each of its user s preference as a ranking over a subset of items that is seen by the user. Such data can be obtained by directly asking to compare some items, or by indirectly tracking online activities on which items are viewed, how much time is spent on the page or how the user rated the items. However, collecting such comparisons over multiple items might come at a cost. We, therefore, want to quantify the gain in the accuracy of the inference when higher order comparison outcomes are collected. We characterize the optimal trade-offbetween accuracy and the number of items compared, and show that our proposed algorithm seamlessly generalizes to this setting and also achieves the optimal trade-off. Customer choices. One of the most widely applicable data collection scenarios is customer purchase history. Online and offline service providers can track each customer on which subset of items is offered and which item is chosen. Given historical data on such choices on best-out-of-a-subset, we extract features on the users and items that best explains the collected data. Bundled choices. Another data collection scenario that is gaining interest recently is bundled choices Chu et al. (2011); Benson et al. (2018). Typical choice models assume that the willingness to buy an item is independent of what else the user bought. In many cases, however, we make bundled purchases: we buy particular ingredients Negahban, Oh, Thekumparampil, and Xu together for one recipe or we buy two connecting flights. One choice (the first flight) has a significant impact on the other (the connecting flight). In order to optimize the assortment (which flight schedules to offer) for maximum expected revenue, it is crucial to accurately predict the willingness of the consumers to purchase bundled items, based on past history. We propose a model that can capture such interacting preferences for bundled items (e.g. jeans and shirts), and use this model to extract the features of the items in each category from historical bundled purchase data. Both our inference algorithm and the analyses extend to this setting, achieving the optimal trade-offbetween sample size and accuracy. Contribution. We first study the canonical scenario of pairwise comparisons from the MNL model in Section 3. Our contribution in the modeling is a new sampling scenario we call graph sampling that captures how different pairs of items have varying likelihood of being compared together. Our algorithmic contribution is a convex relaxation with a new regularizer using a variation of the standard nuclear norm tailored for the graph sampling topology. Our theoretical contribution is in the analysis of the proposed estimator and a matching fundamental lower bound (up to a poly-logarithmic factor). This (a) characterizes the minimax sample complexity of the problem; (b) proves that the proposed estimator cannot be improved upon; and (c) identifies how the accuracy depends on the topology of sampling. This in turn provides a guideline for designing surveys when one has a choice on which pairs are to be compared. This is accompanied by experiments on synthetic and real data sets confirming our theoretical predictions. This framework is extended to higher order comparisons in Section 4. We establish minimax optimality (up to a poly-logarithmic factor) of our estimator and identify the fundamental trade-offbetween accuracy and sample size. When each user provides a total linear ordering among k items, we show that the required sample size effectively is reduced by a factor of k. When the user provides her best choice (as in purchase history) instead of the total linear ordering, we extend our framework and establish minimax optimality in Section 5.2. We also consider a bundled purchase scenario in Section 5, where customers buy pairs of items from each of the two categories. We extend our framework and establish minimax optimality under the bundled purchase setting. We present experimental results on both synthetic and real-world data sets confirming our theoretical predictions and showing the improvement of the proposed approach in predicting users choices1. Technically, we borrow analysis tools from 1-bit matrix completion Davenport et al. (2014), matrix completion Negahban and Wainwright (2012), and restricted strong convexity Negahban et al. (2009), and crucially utilize the Random Utility Model (RUM) Thurstone (1927); Marschak (1960); Luce (1959) interpretation (outlined in Section 2.1) of the MNL model to prove both the upper bound and the fundamental limit. This could be of interest to analyzing more general class of RUMs. Notations. We use |||A|||F and |||A||| to denote the Frobenius norm and the ℓ norm, |||A|||nuc = P i σi(A) to denote the nuclear norm where σi(A) denotes the i-th singular value, and |||A|||2 = σ1(A) for the spectral norm. We use u, v = P i uivi and u to 1. Code for our experiments are available at https://github.com/POLane16/Nucnorm-Ranking. Learning from Comparisons and Choices denote the inner product and the Euclidean norm. All ones vector is denoted by 1, I denotes the identity matrix and I(A) is the indicator function of the event A. The set of the first N integers are denoted by [N] = {1, . . . , N}. 1.1 Related Work Bradley-Terry and Plackett-Luce models. The simplest form of the MNL model is when all users are sharing the same feature vector such that each item is parametrized by a scalar value. This is known as Bradley-Terry (BT) model when pairwise comparisons are concerned and Plackett-Luce (PL) model when higher order comparisons are concerned. This has been proposed and rediscovered several times in the last century Zermelo (1929); Thurstone (1927); Bradley and Terry (1955); Luce (1959); Plackett (1975); Mc Fadden (1973, 1980) in the context of ranking teams in sports games, ranking items based on surveys, and ranking routes in transportation systems. Unlike the general MNL model, maximum likelihood estimator for the BT and PL models are naturally convex programs. However, learning the BT model has first been addressed in Jr. (1957) where the convergence of the iterative algorithm is analyzed, without explicitly relying on the convexity of the problem. A new algorithm based on Majorize-Minimize framework was proposed in Hunter (2004). First sample complexity of learning BT model was provided in Negahban et al. (2012) where a novel estimator, called Rank Centrality, of the BT parameters was proposed. The authors construct a random walk over a graph where the nodes are the items and the transition probability is constructed from the comparisons outcomes. This spectral approach is proven to achieve a minimax optimal sample complexity. This has been a building block for several ranking algorithms, which further process the Rank Centrality to get better accuracy on top of it Chen and Suh (2015); Jang et al. (2016, 2017); Chen et al. (2017). For higher order comparisons, the sample complexity of learning PL model was provided in Hajek et al. (2014); Shah et al. (2014), where the Maximum Likelihood (ML) estimator is shown to achieve the minimax optimality. Later, Maystre and Grossglauser (2015) made the connection between the spectral approach of Rank Centrality and the ML estimator precise by providing a unifying random walk view to the problem. This led to a novel Accelerated Spectral Ranking algorithm introduced in Agarwal et al. (2018), which not only finds the parameters of the PL model more efficiently in computation, but also achieves optimal sample complexity under general sampling graphs. Recently, Borkar et al. (2016) treat the learning problem as solving a noisy linear system, and propose an algorithm that is amenable to on-line, distributed and asynchronous variants. Vojnovic and Yun (2016) analyzes a more general class of random utility models known as Thurstone models, and provide the minimax sample complexity by analyzing the ML estimator. Note that the ML estimators for Thurstone models in general are computationally intractable. Generalized BT and PL models. As studied in Rajkumar and Agarwal (2014), the BT model covers a subset of probabilistic models over comparisons. There is a hierarchy of models with increasing complexity and descriptive power. One popular extension is the mixture of BT or PL models. It is known that any choice model can be approximated arbitrarily close with a mixed PL model with sufficient number of mixture components Mc Fadden and Train (2000). The sample complexity of learning a mixed PL model was analyzed in Oh and Shah (2014) where a tensor decomposition for learning a mixture model Negahban, Oh, Thekumparampil, and Xu was proposed and analyzed under some separation conditions between the weights of the mixtures. For a mixture of two PL model, Chierichetti et al. (2018) shows identifiability and uniqueness of the mixture weights, when all marginal probability over all possible rankings among two items and three items are known. In a crowdsourced setting, Chen et al. (2013) models pairwise comparisons using a mixture of PL models consisting of hammer distribution, which reports the true output of a comparison, and spammer distribution, which reports the exact opposite of a comparison. A different approach that tackles the problem by learning to cluster the users based on the pairwise comparisons is proposed in Rui et al. (2015). The MNL model we study in this paper can be thought of as a generalization of the mixed PL models, where each user has her own preference. To make learning feasible, we inherently impose similarities among users via a low-rank condition. Note that a mixed PL model with r mixture is a special case of the MNL model with rank r, where each user s membership is encoded as a r-dimensional feature in standard basis. In the context of collaborative ranking, algorithms for learning the MNL model from pairwise comparisons have been proposed in Park et al. (2015). Instead of nuclear norm regularization as we propose in this paper, Park et al. (2015) proposes solving a convex relaxation of maximizing the likelihood over matrices with bounded nuclear norm. Under the standard assumption of uniformly chosen pairs, it is shown that this approach achieves statistically optimal generalization error rate, instead of Frobenius norm error that we analyze. Beyond BT and PL models. Modeling choice is an important problem where the ultimate goal is to find the right parametric model to capture human choices. Ragain and Ugander (2016); Blanchet et al. (2013) use Markov chains to model choices with the parameters in the transition matrix defining the probability model. Ideal point model Massimino and Davenport (2018); Kazemi et al. (2018) assumes that the pairwise comparisons of two items by a user depends on their distance from an ideal item (ideal point) for the user in some metric embedding space of the items. Novel nonparametric models have also been proposed to model human choices, for example Shah et al. (2016b); Pananjady et al. (2017); Falahatgar et al. (2018) uses strong stochastic transitivity to model pairwise choices and Farias et al. (2009) uses distribution over all permutations with sparse support to model higher order choices. We also note that in the context of (non-collaborative) ranking, Gleich and Lim (2011) has proposed nuclear norm minimization based algorithm when comparisons between all pairs items are modeled as a low-rank skew-symmetric matrix. Other nonparametric approaches to solving ranking include empirical risk minimization. Cl emen con et al. (2005) analyses risk minimization of U-statistics and a more feasible surrogate convex loss minimization to estimate ranking. Katz-Samuels and Scott (2017) assumes that rating of an item by a user is a Lipschitz function of the user-item pair and analyses a nonparametric collaborative ranking algorithm from partial observation of such ratings. While we are interested in the (parameters of) full ranking over all items, there have been several recent works which aim to only approximately rank the items, such as retrieving only the top-m items Chen and Suh (2015); Jang et al. (2016, 2017) or partitioning the items into ordered buckets of fixed size Katariya et al. (2018); Heckel et al. (2018). Learning from Comparisons and Choices 2. Model and Approach for Pairwise Comparisons The Multi Nomial Logit (MNL) model is one of the most popular models that explains how people make choices when given multiple options and is widely used in behavioral psychology and revenue management. For brevity, we focus our discussion on data collected in the form of pairwise comparisons in Sections 2 and 3, and defer the discussion of the MNL model in its full generality to Sections 4 and 5 . We give a precise definition of the model for paired comparisons and provide a novel algorithmic solution to learn this model from samples. 2.1 Multi Nomial Logit (MNL) Model for Pairwise Comparisons Let Θ be a d1 d2 dimensional matrix capturing preferences of d1 users on d2 items. The probability with which a user, i [d1], when presented with two items j1, j2 [d2], prefers item j1 over item j2 is, P {j1 > j2} = eΘ ij1 eΘ ij1 + eΘ ij2 . (1) This implies that, more preferred items (as per the ordering of Θ ij) are more likely to be ranked higher, with the randomness in choices captured by the probabilistic model. If we do not impose any further constraints on Θ , one entry of Θ is not related in any way to any other entries. This implies that one user s preference is completely independent of others and no efficient learning is possible. Each user s preference has to be learned separately. On the other hand, in real applications, it is reasonable to say that preferences of users depend only on a handful of factors for example, quality, price, and aesthetics. We do not know which features affect users choices, but we assume that there are r-dimensional latent features for each of the users and items that govern such choices, and that r d1, d2. This assumption mathematically captures the conventional belief that when two people have similar preferences over a subset of items, they tend to have similar tastes on other items as well. Formally, MNL model assumes that Θ is a rank r matrix with r d1, d2. In this paper, we do not impose a hard constraint on the rank and provide general results for matrices of any rank. In this case, we identify how the accuracy depends on the rate of decay of the singular values. This MNL model has many roots. In revenue management, this has been proposed as a special case of Random Utility Model (RUM). RUM explains choices that a person makes as the result of maximizing perceived random utilities associated with the set of alternatives presented. In the case of MNL, each decision maker and each alternative are associated with an r-dimensional vector, ui and vj, resulting in a low-rank Θ if Θ ij = ui, vj . The perceived utility of the item j for decision maker i is, Uij = ui, vj + ξij , (2) where ξij s are i.i.d. random variables following the standard Gumbel distribution. Different choices of distributions give different variants of RUMs. In our analyses, we utilize this RUM interpretation of the MNL model to prove a particular concentration in Section C.4, for example. The model in Equation. (1) has also been re-discoverd several times in the literature Zermelo (1929); Thurstone (1927); Luce (1959); Bradley and Terry (1955) in several domains. Negahban, Oh, Thekumparampil, and Xu 2.2 Low-rank Regularization using Nuclear Norm Minimization Given the low-rank structure of the model, a natural but inefficient approach is to minimize the negative of the log likelihood, L( ), regularized by the rank: bΘ arg min Θ Rd1 d2 L(Θ) + λ rank(Θ), (3) for some parameter λ > 0. As this rank minimization is a notoriously challenging problem, we instead solve a convex relaxation of it. Note that the nuclear norm ball is the convex hull of rank-1 matrices Recht et al. (2010). Analogous to l1-norm in the case of sparse vectors, nuclear norm is a tight convex surrogate for low-rank solutions. We propose the following nuclear norm regularized optimization problem, bΘ arg min Θ Ω L(Θ) + λ |||Θ|||nuc, (4) where Ωis a convex constraint which takes care of identifiability and Lipschitz smoothness conditions. Nuclear norm regularization has been widely used Recht et al. (2010) for rank minimization; however, provable guarantees exist only for quadratic loss functions L(Θ) Cand es and Recht (2009); Negahban and Wainwright (2012). Our analyses extend such results to a convex loss, by first proving that L( ) satisfies restricted strong convexity property with high probability. Similar to how (non-collaborative) rank aggregation has been generalized to any strongly log-concave distribution in Shah et al. (2014), our analysis can naturally be extended to a general class of strongly log-concave distributions. We give the expression for the log likelihood in Equation. (8) for pairwise comparisons. 3. Learning MNL Model from Pairwise Comparisons under Graph Sampling Probabilistic model for sampling. In order to provide performance guarantees on the proposed approach, we need to specify how we sample the pairs that are to be compared. We provide a novel sampling model, which we call graph sampling with respect to a weighted graph G. This naturally generalizes Bernoulli sampling typically studied under matrix completion literature Cand es and Recht (2009); Keshavan et al. (2010a); Negahban and Wainwright (2012); Jain et al. (2013), and the resulting analysis captures how the performance depends on the topology of the samples. Note that the proposed graph sampling is different from deterministic sampling graphs studied in Hajek et al. (2014); Shah et al. (2016a). This is analytically tractable only in the simpler case of estimating the weight vector of the PL model where there is only one user and the ML estimator is a convex program. However, such deterministic sampling is notoriously hard to handle for matrix estimation, even in the simpler case of matrix completion Bhojanapalli and Jain (2014). Hence, we introduce a probabilistic model that allows enough flexibility to capture the interesting aspects of sampling biases, i.e. grouping. Precisely, we have a weighted undirected graph G = ([d2], E, {Pj1,j2}(j1,j2) E) with d2 nodes, which represent items, a set of edges E and the edge weight Pj1,j2 between nodes j1 and j2. The weights can be written in a symmetric matrix P Rd2 d2, and Pj1,j2 +Pj2,j1 = 2Pj1,j2 represent the probability with which the pair (j1, j2) is chosen for comparison. Note Learning from Comparisons and Choices that Pj,j = 0 , j [d2], Pj1,j2 = Pj2,j1 and P j1,j2 [d2] Pj1,j2 = 1. We assume we get i.i.d. samples from first choosing a random user among [d1] users, and then choosing a pair (j1, j2) of items at random from P, and finally getting a random comparison from the MNL model, i.e. the probability with which user i prefers item j1 over item j2 is exp Θ ij1/ exp Θ ij1 + exp Θ ij2 . One of the most important aspects of real-world data that is captured by this graph sampling model is grouping. Consider two groups of items, say, cars and phones. It does not make sense to ask an individual to compare a phone with a brand of a car (i.e. direct comparison is not feasible), but knowing an individual s preference on cars can help in learning her preference on phones. In graph sampling terms, we are sampling from a graph G consisting of two disjoint cliques: one for cars and another for phones. By analyzing such a sampling scenario, we want to characterize the gain in using the data from both groups of items together, although there are no inter-group comparisons. In the preference matrix Θ , the values in the set of columns corresponding to each connected component in the sampling graph can be arbitrarily shifted together, without changing the pairwise comparisons outcome distributions. This is because adding the same constant to those items that are compared does not change the probability (for those items within the same group), i.e. P {j1 > j2} = eΘ ij1 eΘ ij1 + eΘ ij2 = eΘ ij1+c eΘ ij1+c + eΘ ij2+c , and adding different constants to those items that are not in the same group does not change the probability of the outcome as those items are never compared. Hence, to handle this unidentifiability, we let a centered version of Θ represent all those shifted versions defining the same probability distribution. Formally, let a zero-one vector gk {0, 1}d2 denote the group membership such that gi,k = 1 if item j is in group k, else gi,k = 0. Note that, by definition, no item can be present in more than one group, that is, PG k=1 gk = 1, where G is the number of groups. We define an equivalence class of Θ which represent the same probabilistic model as [Θ ] = n Θ + k=1 ukg T k for all uk Rd1o . (5) To overcome the identifiability issue, we represent each equivalence class with the centered matrix satisfying Θ gk = 0, k {1, 2, . . . , G} (6) As matrices with large spikiness are known to be hard to estimate Negahban and Wainwright (2012), we capture the dependence of the sample complexity on the spikiness as measured by α := |||Θ ||| . This captures the dynamic range of the underlying preference matrix. For a related problem of matrix completion, where the loss L(θ) is quadratic, either a similar condition on ℓ norm is required or another condition on incoherence is required. Graph Laplacian. The performance of our approach depends on the sampling graph P via its graph Laplacian defined as L = diag(P1) P (7) Negahban, Oh, Thekumparampil, and Xu where diag(P1) is a diagonal matrix with P v Pu,v in the diagonals. Notice that, L is singular and the nullspace is spanned by vectors {gk}G k=1. Let σmax(L) = L 2 and σmin(L) be the smallest eigenvalue of L discounting the G zero-valued eigenvalues. Since the graph has G disconnected maximal components and L is real symmetric, by spectral theorem, L = UΣUT , where U is a matrix of size d2 (d2 G) and its d2 G columns form an orthonormal set, and Σ is a diagonal matrix such that its diagonal elements are the singular values of L. Let L := UΣ 1UT and Lx := UΣx UT for all x R. We also define the Laplacian induced norms of matrices as, |||Θ|||L := ΘL1/2 F, and, |||Θ|||L-nuc := ΘL1/2 nuc . These Laplacian induced norms are more appropriate to analyze and quantify the distance between the estimated matrix bΘ and Θ . When items k(i), l(i) are chosen for comparison by user j(i) as the i-th pair of items, we capture this choice with the matrix X(i) = ej(i)(ek(i) el(i))T . The outcome of the comparison is represented by yi, with yi = 1 when item k(i) wins over item l(i) and yi = 0 if otherwise. The log-likelihood of the comparison outcomes with respect to a parameter matrix Θ is, h yi Θ, X(i) log 1 + exp We propose and analyze the following convex optimization problem, bΘ argmin Θ Ωα L(Θ) + λ|||Θ|||L-nuc, (9) Ωα = n Θ Rd1 d2 | |||Θ||| α, Θgk = 0, k [G] o , (10) with an appropriately chosen λ = 8 n , σmin(L) 1/2 log(2d) with σ = max{ (d2 G)/d1, 1}, where d = (d1 +d2)/2. In practice, the sampling probability distribution P and the corresponding Laplacian L might not be known. In those cases, we propose using the empirical sampling probability distribution ˆP and corresponding empirical Laplacian ˆL instead. We describe this version of the algorithm formally in Section 3.4.4, where we empirically demonstrate the robustness of this approach. Further, in experiments with real data sets, we use the empirical Laplacian Section 3.4.5. 3.1 Performance Guarantee We consider the graph sampling scenario where each sample is i.i.d., the ℓ-th sample consists of user iℓchosen uniformly at random, pair of items (j1,ℓ, j2,ℓ) chosen according to the sampling graph G = ([d2], E, P), and the resulting outcome yℓdistributed as the MNL model with parameter Θ . Learning from Comparisons and Choices Theorem 1 Under the graph sampling with respect to G = ([d2], E, P) with a graph Laplacian L, and under the MNL preference model with preference matrix Θ , solving the optimization problem in (9) with n i.i.d. samples achieves, with probability greater than 1 1/4d3, Θ bΘ L1/2 F 2 36λ α + 1 ψ(2α) 2r Θ bΘ L1/2 F+ min{d1,d2 G} X j=r+1 σj(Θ L1/2) , (11) for any r {1, 2, . . . , min{d1, d2 G}}, any λ 8 n , σmin(L) 1/2 log(2d) where σ = max{(d2 G)/d1, 1} and d = (d1 + d2)/2, ψ(x) ex/(1 + ex)2, and for n min{26d2 1σ2, 22 d1σmin(L) 1 2/3} log(2d). We provide a proof in Appendix A. The above bound holds for any r, where r allows us to trade offthe two types of errors: the estimation error and the approximation error. Concretely, the above bound shows a natural splitting of the error into two terms; the first term corresponding to the estimation error for the top rank-r component of Θ and the second term corresponding to the approximation error for how well one can approximate Θ with a rank-r matrix. If we know the singular values of Θ , we can optimize over r to get the tightest bound. If Θ is exactly low-rank then applying a matching rank in the bound gives the following guarantee. Corollary 2 (Exact rank-r matrix) Under the same hypothesis as in Theorem 1 with a choice of λ = c0 max q n , σmin(L) 1/2 log(2d) for some c0 > 0, if Θ is exactly rank r, there exists a positive constant c1 such that the proposed estimator achieves, Θ bΘ L1/2 F c1 α + 1 ψ(2α) σ d1 log(2d) (σmin(L) 1d1) log(2d) with probability at least 1 2/(d1 + d2)3 and σ = max{(d2 G)/d1, 1}. The second term in the maximization is an artifact of the weakness of current analysis technique and does not reflect the actual error. This is confirmed in our simulation results on graphs with very small spectral gap in Figures 1.(b), (d), and (f), where the error in Laplacian-induced norm error does not decrease with spectral gap of L as the line graph has a much smaller spectral gap compared to a complete graph, for example. In fact, for a special Θ in Figure 1.(d) it is the other way, for which we do not have a theoretical explanation. The number of entries in Θ is d1d2 and we want to rescale the Frobenius norm error appropriately by 1/ d1d2. As a typical scaling of L1/2 is 1/ d2 in spectral norm, we Negahban, Oh, Thekumparampil, and Xu only need to rescale the Laplacian-induced norm error by 1/ d1 in the left-hand side of the above bound. For a rank-r Θ , the number of degrees of freedom in describing it is r(d1 +d2) r2 = O(r(d1 +d2)). The above theorem shows that the total number of samples n needs to scale as O(r(d1 + d2) log d) in order to achieve an arbitrarily small error. This is only a poly-logarithmic factor larger than the degrees of freedom. In Section 3.2 we make this comparison precise by providing a lower bound that matches the upper bound up to a logarithmic factor. The upper-bound constraint in Theorem 1 on the number of samples n can be met for large enough d1 and d2. For simplicity, assume that d1 = d2 = d and r is a constant. Since σmin(L) = O(1/d2), the upper-bound on n becomes O(max{d2, d4/3}). For large enough d, the upper bound on the RHS of Eq. (12) can be made arbitrarily small with n only scaling as O(r d). This is significantly smaller than the upper-bound of O(d2) on n. Further, in the experiments in Section 3.4, we show that n has no practical upper-bound constraint since the error decreases at the same rate as predicted, for arbitrarily large values of n. This constraint may not be necessary and might be a by-product of the proof techniques. The dependence on the dynamic range α, however, is sub-optimal. It is expected that the error increases with α, since the Θ scales as α, but the exponential dependence in the bound seems to be a weakness of the analysis (for example as seen from numerical experiments in the right panel of Figure 6). Although the error increase with α, numerical experiments suggest that it only increases at most linearly. However, tightening the scaling with respect to α is a challenging problem, and such sub-optimal dependence is also present in existing literature for learning even simpler models, such as the Bradley-Terry model Negahban et al. (2012) or the Plackett-Luce model Hajek et al. (2014), which are special cases of the MNL model studied in this paper. Another issue is that the underlying matrix might not be exactly low rank. It is more realistic to assume that it is approximately low rank. Following Negahban and Wainwright (2012) we formalize this notion with ℓq-ball of matrices defined as Bq(ρq) {Θ Rd1 d2 | X j [min{d1,d2}] |σj(Θ )|q ρq} . (13) When q = 0, this is a set of rank-ρ0 matrices. For q (0, 1], this is set of matrices whose singular values decay relatively fast. By optimizing the choice of r in Theorem 1, we get the following result. Corollary 3 (Approximately low-rank matrices) Suppose Θ Bq(ρq) for some q (0, 1] and ρq > 0. Under the hypotheses of Theorem 1, with a choice of λ = c0 max n , σmin(L) 1/2 log(2d) for some constant c0 > 0 there exists a constant c1 > 0 such that solving the optimization (9) achieves with probability at least 1 2/(d1 + d2)3, Θ bΘ L1/2 F c1 ρq d1 α + 1 ψ(2α) d2 1 σ log(2d)) provided n σ log(2d)/σmin(L). Learning from Comparisons and Choices This is a strict generalization of Corollary 2. For q = 0 and ρ0 = r, this recovers the exact low-rank estimation bound up to a factor of two. For approximate low-rank matrices in an ℓq-ball, we lose in the error exponent, which reduces from one to (2 q)/2. 3.2 Information-theoretic Lower Bound For a polynomial-time algorithm of convex relaxation, we gave in the previous section a bound on the achievable error. We next compare this to the fundamental limit of this problem, by giving a lower bound on the achievable error by any algorithm (efficient or not). A simple parameter counting argument indicates that it requires the number of samples to scale as the number of degrees of freedom i.e., n r(d1 + d2), to estimate a d1 d2 dimensional matrix of rank r. We construct an appropriate packing over the set of low-rank matrices with bounded entries in Ωα defined as (10), and show that no algorithm can accurately estimate the true matrix with high probability using the generalized Fano s inequality. This provides a constructive argument to lower bound the minimax error rate, which in turn establishes that the bounds in Theorem 1 is sharp up to a logarithmic factor, and proves no other algorithm can significantly improve over the nuclear norm minimization. Theorem 4 Suppose Θ has a rank r. Under the previously described graph based sampling model, there exists a constant c > 0 such that inf bΘ sup Θ Ωα E h 1 d1 Θ bΘ L1/2 F i c min e α r tr L r , d2 d1 log d where the infimum is taken over all measurable functions over the observed comparison results and L r is the pseudo inverse of the rank r approximation of the graph Laplacian. A proof of this theorem is provided in Appendix B. The term of primary interest in this bound is the first one, which shows the scaling of the (rescaled) minimax rate as p rd1/n and matches the upper bound in (12) up to a logarithmic factor. It is the dominant term in the bound whenever the number of samples is larger than n d1 max{tr L r , d1 log d/d2 2}. As suggested in numerical simulations on graphs with very small spectral gap in Figures 1.(b), (d), and (f), the dependence in tr L r is an artifact of the weakness of the current analysis technique. Here we note that, while the lower bound in Theorem 4 is in expectation, the upper bound in Theorem 1 is a high-probability result. The upper bound can immediately be translated into a bound in expected error with an additional term scaling as ασmax(L)1/2 d2d 3, which is smaller than other terms in the bound. 3.3 Performance Guarantee and Lower Bound for Complete Graph It follows from a simple relation |||(Θ bΘ)L1/2|||F σ1/2 min|||Θ bΘ|||F , which is true since Θ , bΘ are in the range space of L, that the above upper bounds automatically give the error bound in the Frobenius norm. When the sampling graph is uniform, i.e. a complete graph Negahban, Oh, Thekumparampil, and Xu with equal weights Pj1,j2 = 1/d2(d2 1), j1 = j2, Frobenius norm is the right metric and we show matching upper and lower bounds. Corollary 5 (Complete graph upper-bound) Under the same hypothesis as in Corol- lary 2, if G is a complete graph, with a choice of λ = c0 max q (d2 1) log(2d) for some c0 > 0, if Θ is exactly rank r, there exists a positive constant c1 such that the proposed estimator achieves, Θ bΘ F p d1(d2 1) c1 α + 1 ψ(2α) σ d1 log(2d) (d2 1)d1 log(2d) with probability at least 1 2/(d1 + d2)3 and σ = max{(d2 1)/d1, 1}. Corollary 6 (Complete graph lower-bound) Suppose Θ has rank r. Under the previously described graph based sampling model with graph being a complete graph, there is a universal numerical constant c > 0 such that inf bΘ sup Θ Ωα E h 1 p i c min e α r (d2 1) , d2 d1 log d where the infimum is taken over all measurable functions over the observed comparison results. 3.4 Experiments We provide a first-order method to solve the proposed convex optimization, and provide numerical experiments using this algorithm. We present two simulation results followed by an experiment on real data. For the synthetic experiments, we generate random rank-r matrices of dimension d d, of the form Θ = UV T with U Rd r and V Rd r entries generated i.i.d from uniform distribution over [0, 1]. Then the connected-component-mean is subtracted form each connected component, and then the whole matrix is scaled such that the largest entry is α = 5. Note that this operation does not increase the rank of the matrix Θ. This is because this de-meaning can be written as Θ P k Θgg T /(g T 1) and both terms in the operation are of the same column space as Θ which is of rank r. 3.4.1 Algorithm Let Θ ΘL1/2. As the nuclear norm regularizer in (9) is non differentiable, we use the proximal gradient descent Agarwal et al. (2010); Cai et al. (2010). At each iteration, we apply the following two operations on the current estimate, Θ t, of Θ L1/2, f Θ t+1 = Θ t ηt ΘL(Θ t L 1/2)L 1/2 (gradient descent) (18) Θ t+1 = Mt(Γt ηtλI)+NT t (singular value shrinkage and thresholding) (19) Learning from Comparisons and Choices where MtΓt NT t := f Θ t is the singular value decomposition of f Θ t, such that Γt is a diagonal matrix with positive entries, ( )+ is the entry-wise thresholding operation max(0, x), and ηt is an appropriate step-size. Constraint of zero row sum, is taken care of by initializing the descent algorithm with Θ 0 = 0, since rows of gradients sum to zero. In practice we do not know the value of α, and hence in experiments we do not enforce the Θ α constraint. Another issue in the implementation is that the convergence rate can be significantly slower for some graph topologies. We accelerate the proximal gradient descent with the following (modified) Barzilai-Borwein (BB) rule Barzilai and Borwein (1988) for choosing the step-size ηt, |||Θ t Θ t 1|||2 2 Θ t Θ t 1, Θ L (Θ t) Θ L (Θ t 1) , when t is odd Θ t Θ t 1, Θ L (Θ t) Θ L (Θ t 1) ||| Θ L (Θ t) Θ L (Θ t 1)|||2 2 , when t is even , (20) where Θ L (Θ ) := ΘL(Θ L 1/2)L 1/2. We stop the descent algorithm whenever an upper bound of the KKT error is smaller than 10 5. 3.4.2 The Role of the Topology of the Sampling Pattern In figure 1, we plot the error of our nuclear norm minimization based algorithm versus number of samples (in log-scale), n for d1 = d2 = 300, r = 4, α = 5.0, G = 1. We consider two errors here; root mean squared error (RMSE) = Θ bΘ F/ d1d2 and Laplacian induced RMSE (L-RMSE)= (Θ bΘ)L1/2 F/ d1. We plot these errors for four topologies of varying spectral gaps. As discussed Section 3.1, we do not expect the L-RMSE error to change much as we change the topology of sampling. However, as seen from the simple relation |||(Θ bΘ)L1/2|||F σ1/2 min|||Θ bΘ|||F Frobenius norm error is more sensitive to the topology of the sampling pattern, captured via the spectral gap, i.e. σmin(L). Specifically we use the following graph topologies. Complete graph. We first consider a uniform sampling over a complete graph where Pj1,j2 = 1/d2(d2 1) for all j1, j2 [d2]. The resulting spectral gap is 1/(d2 1), which is the maximum possible value. Hence, complete graphs are optimal for learning MNL models, compared in the error metric of the Frobenius norm for fairness. Star graph. Here we choose one item to be the center, and every other items can only be compared to this center item uniformly at random. Let item 1 be the center one, then Pj1,1 = P1,j2 = 1/2(d2 1). Standard spectral analysis shows that the spectral gap is Θ(1/d2), and thus the graph is near-optimal for learning MNL models. Line graph. Next, we consider a line graph with d2 1 edges where Pj,j+1 = Pj+1,j = 1/2(d2 1). It has a spectral gap of Θ(1/d2 2), and is strictly sub-optimal for learning MNL models. Barbell graph. Consider two equal sized groups of items. Within each group the sub-graph is complete, and between the groups there is a single edge connecting one of the node from group one and one of the node from group two. Each edge is chosen Negahban, Oh, Thekumparampil, and Xu (a) RMSE for i.i.d. Θ ij (b) L-RMSE for i.i.d. Θ ij (c) RMSE for barbell bias Θ ij (d) L-RMSE for barbell bias Θ ij (e) RMSE for line bias Θ ij (f) L-RMSE for line bias Θ ij Figure 1: Graphs with small spectral gap achieve significantly larger Frobenius norm error (RMSE) Θ bΘ F/ d1d2, whereas the Laplacian-induced norm error (L-RMSE) (Θ bΘ)L1/2 F/ d1 is not sensitive to the spectral gap. Learning from Comparisons and Choices uniformly at random for comparisons. The resulting spectral gap is Θ(1/d2 2), and this graph too is strictly sub-optimal for learning MNL models. First in sub-figures 1a, 1b, we plot RMSE and L-RMSE errors for different graphs using randomly generated Θ ij. We see that L-RMSE curves for different graphs are the same (and slopes in log-scale are as expected approaches 1/2 with more samples). Further, we do not see any significant difference w.r.t the graph topology even when error is measured in Frobenius norm. The reason is that since Θ ij s are generated i.i.d., the empirical distributions of any large sub-group of items would be similar. Thus, the means of the two cliques of the barbell graph or the means of the items on the two far ends of the line graph are similar. Thus although barbell and line graphs have small spectral gap (high mixing-time), its effect is minimized because these sub-groups can individually be solved without having them to mix since the empirical distributions of the Θ ij in the two sub-groups are similar. To illustrate the role of the topology of the graph, we choose specific Θ which depends on the topology of the graph as guided by our analysis on the lower bound (Theorem 4) in sub-figures 1c, 1d. The items are divided into two sets (corresponding to each side of barbell graph), such that corresponding Θ ij are i.i.d. inside a set but have similar but shifted means across the sets. We call this type of preference data as barbell biased. As expected from theoretical analyses, L-RMSE behave similar to the i.i.d. case. However, we see the Frobenius norm error significantly worse in the case of line and barbell shaped graphs, as expected from the Frobenius error bound. In sub-figures 1e, 1f, we simulate line biased preference data Θ . Items are ordered (in the order of the line graph), such that Θ ij s have similar distributions but their means get shifted in an arithmetic progression as you go down the ordering. Again, Frobenius norm error is significantly larger for line and barbell graphs as spectral gaps are small. 3.4.3 The Gain in Inference over Multiple Groups of Items Consider G groups of items such that, within each group, every pair of items is uniformly likely to get compared, but items from different group are never compared with each each other. As a baseline, one can run inference on each group separately. On the other hand, we propose running inference on all the G groups jointly. Let bΘ be the estimate of Θ when solving the groups together, and let Θ be the estimate when groups are estimated separately. Let L, L(k) be the graph Laplacians of the whole graph and k-th connected component (group) respectively. suppose, for simplicity, that d1 = d2 and the groups are equally sized complete sub-graph components, L = 1 (d2 G) k=1 gkg T k , and, (21) L(k) = 1 (d2/G 1) d2 11T . (22) According to Theorems 1 and 4, the L-RMSE error of bΘ satisfies, Θ bΘ L1/2 F = 1 p Negahban, Oh, Thekumparampil, and Xu Figure 2: As the number of groups increase, the gain in joint inference increases. Similarly L-RMSE error (with respect to the full Laplacian L) of Θ satisfies, 2 = 1 d1(d2 G) Θ Θ F 2 (24) (a) = (d2/G 1) Θ k Θk (L(k))1/2 2 F d2 (25) where (a) follows from Eq. (22) and assuming Θ k, Θk are sub-matrices restricted to the columns in group k. Thus the estimation errors when running joint inference and separate inference for each group are of the order of OG(1) and OG( G) respectively. That is, a user s preference in one group of items will be useful in inferring the same user s preference in another group of items. We illustrate this gain of joint inference in Figure 2. Concretely, the sampling graph G has G groups where each component is a complete graph and d1 = d2 = 360, r = 4, α = 5.0, n = 214. Figure 2 plots the L-RMSE (RMSE) = (Θ bΘ)L1/2 F/ d1 ( (Θ bΘ) F/ d1 d2) errors vs. G, when all the groups are solved together (labelled as LMSE and MSE) or when the groups are solved separately (labelled as LMSE Alone and MSE Alone) using our algorithm. We see that solving the components together keeps the error relatively similar as the number of groups increase, but if we solve Learning from Comparisons and Choices 103 104 105 106 number of comparisons, n line: ˆL barbell: L barbell: ˆL Figure 3: L-RMSE for various sampling graphs when true Laplacian L is known (solid) and when empirical Laplacian ˆL is used (dashed) the groups separately the error increases with number of groups, although it is at a lower rate than predicted by the upper bound. 3.4.4 Robustness to the Mismatched L-nuc Norm Regularizer In practical scenarios, one might not have access to the sampling graph Laplacian L. We propose using empirical Laplacian ˆL defined as ˆL diag ˆP1 ˆP , (28) where ˆP Rd2 d2 is the empirical distribution of sampled pairs in the given data. Under the experimental setting from Figure 1b, we run additional experiments with this empirical Laplacian ˆL in the optimization: minimize L(Θ) + λ|||Θ|||ˆL-nuc . Figure 3 illustrates that the effect on the performance of not knowing the true L is marginal. Both approaches achieve the same error. 3.4.5 Real data: Food100 To showcase the practicality of our nuclear norm based algorithm (9) we apply our algorithm to the Food100 Data set2 Wilber et al. (2014). In the data set, n = 250320 triplets, denoted by {(ai, bi, ci)}i, of 3 distinct food dishes from a selection of d = 100 were sampled. Then in 2. Data set is from https://vision.cornell.edu/se3/projects/cost-effective-hits. Negahban, Oh, Thekumparampil, and Xu a crowdsourcing setting, users were asked if, ai is more similar to bi than to ci. The goal is to learn an low-dimensional embedding of the 100 food items where the above similarities are captured. We model the problem as learning an MNL model, parameterized by Θ , which gives the following probability distribution for i-th user s answer, P {ai is more similar to bi than to ci} = eΘ ai,bi eΘ ai,bi + eΘ ai,ci . This is the same model as the pairwise comparisons from Section 2, except for the fact that instead of a user (row) comparing two items (columns), here we compare a food item (row) to two other food items (columns). We implement three different algorithms: our nuclear norm based algorithm ( nucnorm ), unregualrized (λ = 0) likelihood maximization ( fullrank ) and maximum likelihood based algorithm to learn rank-1 Plackett-Luce model Luce (1959); Plackett (1975) ( plackett ). In Figure 4, we plot the mean log-likelihood of the learned model versus fraction x of the data used for training for the various algorithms for testing (a) and training (b) data. If x fraction of the data is used for training, we use the rest (1 x) of the data for testing. For the nuclear norm minimization, we estimate the Laplacian L using the empirical distribution of the triplets and λ is chosen to be 0.1 p log(d)/2 d xn. In the Fig. 4(b) (to the left) on the testing data set, we see that our MNL model based nuclear norm regularized algorithm clearly outperforms both unregularized algorithm and the Placket-Luce model estimator, especially when there is less training data. In fact, the mean likelihood (log(Pmodel(test data))) on the testing data remains relatively the same when we decrease the size of the training data, which supports our claim that real data has low-rank structure. In the Fig. 4(b) (right) on the training data sets, the non-regularized approach of fullrank achieves higher likelihood on the training data, indicating that it overfits to training data. (a) Mean log-likelihood on testing data set (b) Mean likelihood on training data set Figure 4: Mean log-likelihood vs fraction of the total data used for training. Our nuclear norm regularized algorithm ( nucnorm ) fits the test data better than both unregularized algorithm ( fullrank ) and Plackett-Luce model based estimation, especially when training data is small in size. Learning from Comparisons and Choices Figure 5: Food100: t-SNE embedding of the columns of the learned MNL parameter bΘ. Desserts (bottom left) are seperated from other dishes. Meat dishes are also separated from vegetable dishes. In Fig. 5 we plot the t-SNE embedding Maaten and Hinton (2008) of the columns of the estimated MNL parameter matrix bΘ when all the data is used for training. The desserts (left bottom) are separated from other dishes, and meat dishes and salad dishes form two clusters (top right). 4. Learning the MNL Model under Higher Order Comparisons Higher order comparisons, where a subset of k items are offered to a user who then provides a complete ranking (total linear ordering) of those item, is a natural generalization of pairwise comparisons, that captures some aspect of heterogeneous and complex modern data sets. We refer to such scenarios as k-wise comparisons or k-wise rankings. The MNL model generalizes to such comparisons. Let Θ be the d1 d2 dimensional matrix capturing the preference of d1 users on d2 items, where the rows and columns correspond to users and items, respectively. In this k-wise ranking setting, when a user i is presented with a set, Negahban, Oh, Thekumparampil, and Xu Si [d2], of k alternatives she reveals her preferences as a ranked list over those items. To simplify the notations, we assume that all the users compare the same number k of items, but the analysis naturally generalizes to the case when the size might differ from a user to a user and when each user provides more than one k-wise ranking. Let vi,ℓ Si denote the (random) ℓ-th best choice of user i. Each user gives a ranking, independent of other users rankings, from P {vi,1, . . . , vi,k|Si is presented to user i} = e Θ i,vi,ℓ P j Si,ℓeΘ i,j , (29) where with Si,ℓ Si \ {vi,1, . . . , vi,ℓ 1} and Si,1 Si. For a user i, the i-th row of Θ represents the underlying preference vector of the user, and the more preferred items are more likely to be ranked higher. Similar to the pairwise comparisons, the distribution (29) is independent of shifting each row of Θ by a constant. Since we can only estimate Θ up to this equivalent class, we search for the one whose rows sum to zero, i.e. P j [d2] Θ i,j = 0 for all i [d1]. For capturing the spikiness Negahban and Wainwright (2012) of Θ , we define α maxi,j1,j2 |Θ ij1 Θ ij2| to denote the dynamic range of the underlying Θ , such that when k items are compared, we always have 1 ke α 1 1 + (k 1)eα P {vi,1 = j} 1 1 + (k 1)e α 1 for all j Si, all Si [d2] satisfying |Si| = k and all i [d1]. We do not make any assumptions on α other than that α = O(1) with respect to d1 and d2. Given this definition, we solve the following optimization bΘ arg min Θ Ωα L(Θ) + λ|||Θ|||nuc, (31) L(Θ) = 1 k d1 Θ, eie T vi,ℓ log Ωα = n A Rd1 d2 |||A||| α, and i [d1] we have X j [d2] Aij = 0 o . (33) Note that unlike graph sampling for pairwise comparisons, we assume that each user is presented a subset of k items and provides a complete ranking over those k items. This choice of sampling scenario, together with independent choices of the items in subset Si s, is crucial for getting a bound that is tight in its scaling with respect to not only d1, d2, and r, but also k, as a certain independence is required to apply the symmetrization technique (in Lemma 29) which gives us the desired tight bound on the error. It trivially follows from our analysis that one can relax the assumptions in the sampling scenario significantly (e.g. sampling without replacement, heterogeneous sampling probabilities for each item-user pair, etc.), and the only change in the upper bound of Eq. (34) will be a weaker dependence k. Learning from Comparisons and Choices 4.1 Performance Guarantee We provide an upper bound on the resulting error of our convex relaxation, when a multi-set of items Si presented to user i is drawn uniformly at random with replacement. Precisely, for a given k, Si = {ji,1, . . . , ji,k} where ji,ℓ s are independently drawn uniformly at random over the d2 items. Further, if an item is sampled more than once, i.e. if there exists ji,ℓ1 = ji,ℓ2 for some i and ℓ1 = ℓ2, then we assume that the user treats these two items as if they are two distinct items with the same MNL weights Θ i,ji,ℓ1 = Θ i,ji,ℓ2. The resulting preference is therefore always over k items (with possibly multiple copies of the same item), and distributed according to (29). For example, if k = 3, it is possible to have Si = {ji,1 = 1, ji,2 = 1, ji,3 = 2}, in which case the resulting ranking can be (vi,1 = ji,1, vi,2 = ji,3, vi,3 = ji,2) with probability (eΘ i,1)/(2 eΘ i,1 + eΘ i,2) (eΘ i,2)/(eΘ i,1 + eΘ i,2). Such a sampling with replacement is necessary for the analysis, where we require independence in the choice of the items in Si in order to apply the symmetrization technique (e.g. Boucheron et al. (2013)) to bound the expectation of the deviation (cf. Appendix C.4). Similar sampling assumptions have been made in existing analyses on learning low-rank models from noisy observations, e.g. Negahban and Wainwright (2012). Let d (d1 + d2)/2, and let σj(Θ ) denote the j-th singular value of the matrix Θ . Define d1 log d + d2 (log d)2(log 2d)4 k d2 1 d2 . (34) Theorem 7 Under the described sampling model, assume 24 k min{d2 1 log d, (d2 1 + d2 2)/(2d1) log d, (1/e) d2(4 log d2 + 2 log d1)}, and λ [480λ0, c0λ0] with any constant c0 = O(1) larger than 480. Then, solving the optimization (31) achieves 2 e4αc0λ0 r bΘ Θ F + 288e4αc0λ0 min{d1,d2} X j=r+1 σj(Θ ) , (35) for any r {1, . . . , min{d1, d2}} with probability at least 1 2d 3 d 3 2 where d = (d1+d2)/2. A proof is provided in Appendix C. This bound holds for all values of r and one could potentially optimize over r. We show such results in the following corollaries. Corollary 8 (Exact low-rank matrices) Suppose Θ has rank at most r. Under the hypotheses of Theorem 7, solving the optimization (31) with the choice of the regularization parameter λ [480λ0, c0λ0] achieves with probability at least 1 2d 3 d 3 2 , r(d1 log d + d2 (log d)2(log 2d)4) k d1 . (36) The number of entries is d1d2 and we rescale the Frobenius norm error appropriately by 1/ d1d2. For a rank-r matrix Θ with r(d1 + d2) r2 = O(r(d1 + d2)) degrees of freedom, the above theorem shows that the total number of samples, which is (k d1), needs to scale Negahban, Oh, Thekumparampil, and Xu as O(rd1(log d) +rd2 (log d)2(log 2d)4) in order to achieve an arbitrarily small error. This is only poly-logarithmic factor larger than the degrees of freedom. In Section 4.2, we provide a lower bound on the error directly, that matches the upper bound up to a logarithmic factor. The dependence on the dynamic range α is sub-optimal. The exponential dependence in the bound seems to be a weakness of the analysis, as seen from numerical experiments in the right panel of Figure 6. Although the error increase with α, numerical experiments suggests that it only increases at most linearly. A practical issue in achieving the above rate is the choice of λ, since the dynamic range α is not known in advance. Figure 6 illustrates that the error is not sensitive to the choice of λ for a wide range. For approximately low-rank matrices in ℓq-ball defined in (13), optimizing the choice of r in Theorem 7, we get the following result. This is a strict generalization of Corollary 8 and a proof of this Corollary is provided in Appendix D. Corollary 9 (Approximately low-rank matrices) Suppose Θ Bq(ρq) for some q (0, 1] and ρq > 0. Under the hypotheses of Theorem 7, solving the optimization (31) with the choice of the regularization parameter λ [480λ0, c0λ0] achieves with probability at least 1 2d 3, bΘ Θ F 2 ρq d1d2 d1d2(d1 log d + d2 (log d)2(log 2d)4) 4.2 Information-theoretic Lower Bound for Low-rank Matrices A simple parameter counting argument indicates that it requires the number of samples to scale as the degrees of freedom i.e., kd1 r(d1 + d2), to estimate a d1 d2 dimensional matrix of rank r. By applying Fano s inequality with appropriately chosen hypotheses, the following lower bound establishes that the bound in Theorem 7 is sharp up to a logarithmic factor. Theorem 10 Suppose Θ has rank r. Under the described sampling model, for large enough d1 and d2 d1, there is a universal numerical constant c > 0 such that inf bΘ sup Θ Ωα E h 1 d1d2 r d2 k d1 , αd2 d1d2 log d where the infimum is taken over all measurable functions over the observed ranked lists {(vi,1, . . . , vi,k)}i [d1]. A proof of this theorem is provided in Appendix E. The term of primary interest in this bound is the first one, which shows the scaling of the (rescaled) minimax rate as p r(d1 + d2)/(kd1) (when d2 d1), and matches the upper bound in (35). It is the dominant term in the bound whenever the number of samples is larger than the degrees of freedom by a logarithmic factor, i.e., kd1 > r(d1 + d2) log d, ignoring the dependence on α. This is a typical regime of interest, where the sample size is comparable to the latent dimension of the problem. In this regime, Theorem 10 establishes that the upper bound in Theorem 7 is minimax-optimal up to a logarithmic factor in the dimension d. Learning from Comparisons and Choices 4.3 Rank Breaking for Higher Order Comparisons A common approach in practice to handle higher order comparisons is rank breaking, which refers to the practice of breaking the higher order comparisons into a set of pairwise comparisons and applying an estimator tailored for pairwise comparisons treating each pair as independent Azari Soufiani et al. (2013, 2014). When the higher order comparison is given as partial rankings (as opposed to total linear ordering as we assume) then rank breaking can be inconsistent, and special algorithms are needed for weighted rank breaking Khetan and Oh (2016a,b). However, when k-wise rankings (also called total linear orderings) are observed as we assume, simple and standard rank breaking achieves a similar performance as the higher order estimator in (31). Assume that ui,m, i [d1], m [k], denotes the m-th element observed by the i-th user. Concretely, in rank breaking, we convert the kwise ranking data into pairwise ranking data and then we solve the following optimization problem: Θi, hi(m1,m2) log exp Θi, ui,m1 + exp Θi, ui,m2 where P0 = {(i, j) : 1 i < j k}, and hi (m1, m2) and li (m1, m2) is defined as the higher and lower ranked index among ui,m1 and ui,m2 respectively. Then modified optimization problem becomes, bΘ arg min Θ Ωα L(Θ) + λ|||Θ|||nuc (40) Let d (d1 + d2)/2, and let σj(Θ ) denote the j-th singular value of the matrix Θ . Define d log d k d2 1 d2 . (41) With this choice of regularization coefficient, we get the following upper bounds on the rank breaking estimator (40) that are comparable to the upper bounds of k-wise ranking estimator in Theorem 7 and Corollary 8. Theorem 11 Under the described sampling model, assume 2(c + 4) log d k max{d1, d2 2/d1} log d, d1 4, and λ [2 p 32(c + 4)λ0, cpλ0] with any constant c = O(1) larger than 2 p 32(c + 4). Then, solving the optimization (40) achieves 2 e2αcλ r bΘ Θ F + 144e2αcλ min{d1,d2} X j=r+1 σj(Θ ) , (42) for any r {1, . . . , min{d1, d2}} with probability at least 1 2d c 2d 213 where d = (d1 + d2)/2. A proof of this theorem is provided in Appendix F, and the following corollary follows for rank-r matrices. Negahban, Oh, Thekumparampil, and Xu Corollary 12 (Exact low-rank matrices) Suppose Θ has rank at most r. Under the hypotheses of Theorem 11, there exists a constant c1 > 0 such that solving the optimization (40) with the choice of the regularization parameter λ [2 p 32(c + 4)λ0, cλ0] achieves with probability at least 1 2d c 2d 213, k d1 . (43) 4.4 Experiments We provide results from numerical experiments on both synthetic and real data sets. 4.4.1 Algorithm Similar to the case of pairwise comparisons in Section 3.4.1, we use proximal gradient descent Agarwal et al. (2010); Cai et al. (2010) along with modified Barzilai-Borwein (BB) step-size selection rule Barzilai and Borwein (1988) with the initial point Θ0 = 0. Each iteration of the algorithm applies the following two operations on the current estimate, Θt, of Θ , eΘt+1 = Θt ηt ΘL(Θt) (gradient descent) (44) Θt+1 = Mt(Γt ηtλI)+NT t (singular value shrinkage and thresholding) (45) where MtΓt NT t := eΘt is the singular value decomposition of eΘt, such that Γt is a diagonal matrix with positive entries, ( )+ is the entry-wise thresholding operation max(0, x), and ηt is an BB step-size calculated as, ( |||Θt Θt 1|||2 2/ Θt Θt 1, ΘL(Θt) ΘL(Θt 1) , when t is odd Θt Θt 1, ΘL(Θt) ΘL(Θt 1) /||| ΘL(Θt) ΘL(Θt 1)|||2 2, when t is even . 4.4.2 Simulation: Higher Order Comparisons The left panel of Figure 6 confirms the scaling of the error rate as predicted by Corollary 8. The lines merge to a single line when the sample size is rescaled appropriately (inset). We make a choice of λ = p (log d)/(kd2). This choice is independent of α and is smaller than proposed in Theorem 7. We generate the random rank-r true MNL parameters matrices of dimension d d using the process mentioned in Section 3.4.1. The root mean squared error (RMSE) is plotted where RMSE = (1/ d1 d2)|||Θ bΘ|||F. We implement and solve the convex optimization (31) using proximal gradient descent method as analyzed in Agarwal et al. (2010). The right panel in Figure 6 illustrates that the actual error is insensitive to the choice of λ for a broad range of λ [ p (log d)/(kd2), 28p (log d)/(kd2)], after which it increases with λ. 4.4.3 Simulation: Rank Breaking In this section we compare the higher order k-wise comparison algorithm (31) ( kwise ) with the pairwise rank breaking algorithm (40) ( kbreak ). We use the same setting as Learning from Comparisons and Choices 100 1000 10000 100000 0.01 d=50, r=3 d=50, r=6 d=50, r=12 d=50, r=24 sample size k 1 10 100 1000 10000 100000 (log d)/(kd2) α = 15 α = 10 α = 5 Figure 6: The (rescaled) RMSE scales as p r(log d)/k as expected from Corollary 8 for fixed d = 50 (left). In the inset, the same data is plotted versus rescaled sample size k/(r log d). The (rescaled) RMSE is stable for a broad range of λ and α for fixed d = 50 and r = 3 (right). in Section 4.4.2, where we observe samples from k-wise ranking from an underlying true MNL model and the aim is to recover the true parameter Θ of the model. We use λ = 0.45 p (log d)/(kd2) and λ = 0.1 p (log d)/(kd2) for k-wise and pairwise rank breaking algorithms respectively. In Fig. 7 we plot the RMSE for both the algorithm for d = 50 and r = 3, 12. We note that the even though the RMSE decreases in the rate as predicted by the theorem, we see that pairwise rank breaking is worser than the higher order k-wise algorithm which directly uses the k-wise rankings. This is consistent with the experimental observation made previously in Hajek et al. (2014). Further we note that rank breaking is much slower than the other algorithm, since gradient computation of the former takes O(k2) time whereas for the latter it can be computed in O(k) time. 4.4.4 Real data: Jester Jester data set3 Goldberg et al. (2001) has 24, 982 users, each rating a subset of 100 jokes on continuous scale of [ 10, 10]. As the scale is continuous, we derive ordinal data from the scores (ties broken uniformly at random). We use only the 7200 users who rated all the jokes for our experiments. For each user, k = 100x jokes were randomly selected uniformly at random for training, rest of the 100 k = 100(1 x) jokes where used for testing, where x is the fraction of jokes selected for training. We implment four algorithms: nuclear norm minimization ( nucnorm ) (31), unregularized (λ = 0) log-likelihood maximization ( fullrank ), rank-1 Plackett-Luce model estimation ( plackett ), and rank breaking algorithm ( rankbreak ) (40). We use λ = 0.7 p (0.5 log(d1d2))/(kd1 d1d2) and λ = 0.16 p (0.5 log(d1d2))/(kd1 d1d2) for k-wise and pairwise rank breaking algorithms respectively. In Fig. 8 (a) we plot the multiplicative bias in the mean log-likelihood on the testing data versus the fraction x of training data used. For each model in { nucnorm , 3. Data set is from http://eigentaste.berkeley.edu/dataset/. Negahban, Oh, Thekumparampil, and Xu sample size k Figure 7: Rank Breaking: RMSE error versus number of samples per user k. k-wise ( kwise ) algorithm performs better than the rank breaking ( kbreak ) approach. fullrank , plackett , rankbreak }, we plot in the y-axis log(Pmodel(test data)) log(Pfullrank(test data)) |log(Pfullrank(test data))| , using fullrank model as a baseline as it has the least test likelihood. Plackett-Luce model achieves the best performance when sample size is small, as this simplest model avoids overfitting. However, for most regimes of sample size, both the nuclear norm minimization and rank breaking achieve similar performance improving upon the others. The same trend holds when we measure the perfomrance in the normalized Spearman s footule distance Diaconis and Graham (1977) F(π1, π2) [0, 1] between two rank-lists π1, π2 of length k: F(π1, π2) = 2 i=1 |π1(i) π2(i)| In Fig. 8 (b) we plot the average normalized Spearman s footrule distance between the ground truths and the most likely ranking on the testing data under the estimated model parameters. We see that k-wise nuclear norm minimization and rank breaking algorithms perform the best in recovering the true ranking, except when the fraction of training data used is very small so that the rank-1 Plackett-Luce recovers better ranking. 4.4.5 Real data: Irish Election The Irish Election data set4 is an opinion poll conducted among 1083 participants during the 1997 Irish presidential election campaign Gormley and Murphy (2009). Each participant responded with a ranking the of their top 1, 2, 3, 4, or 5 choices from the 5 candidates: 4. Data set is from https://projecteuclid.org/euclid.aoas/1231424218#supplemental. Learning from Comparisons and Choices (a) Multiplicative bias in the mean log-likelihood (b) Spearman s footrule distance Figure 8: Jester data set: Performance on test data vs. fraction of the total data used for training. The proposed nuclear norm regularized algorithm ( nucnorm ) and rank breaking ( rankbreak ) improves upon both the unregularized algorithm ( fullrank ) and the Plackett Luce model estimation ( plackett ) for most regimes of the sample size. Banotti, Mc Aleese, Nally, Roche, and Scallon. For our experiments we use only the 807 participants who gave their top-5 choices, i.e. full-rankings of all the candidates. Next we divide these participants into 60 (2x3x5x2) group according to a Cartesian product of four categorizations: sex (male/female), marital status (single/married/widowed+divorced), social class (F/AB/C1/C2/DE)5, location (rural/city+town). We assume that within each group the responses of its member follow the same distriubtion and these distributions of all all the groups are captured by an MNL model with parameter Θ R60 5. We implement three algorithms: nuclear norm minimization ( nucnorm ) (31), unregularized (λ = 0) log-likelihood maximization ( fullrank ), and rank-1 Plackett-Luce model estimation ( plackett ). We use λ = 0.8 p (0.5 log(d1d2))/(kd1 d1d2). If x randomly sampled fraction of the data is used for training, then rest of the data is used for testing. In Fig. 9 we plot the mean log-likelihood (log(Pmodel(test data))) on the testing data versus the fraction of training data used. We see that nuclear norm minimization and Plackett-Luce model estimation tie for the first place and both improves significantly upon the un-regularized full-rank MNL model estimation. In Fig. 11 we plot the t-SNE Maaten and Hinton (2008) embedding of the rows of the estimated parameter matrix bΘ when all the data is used for training. In Fig. 11a the markers represent the marital status of the group: single/married/divorced+widowed. In Fig. 11b the markers represent the social class of the groups. We see that married (left) and divorced+widowed (right) groups are clearly separated in the embedding, indicating that marital status influences the preference of candidates. However, we see that the social classes are less influential. In Fig. 10 we represent the voting characteristics of the top 4 right singular vectors {ˆvj}4 j=1 of bΘ, which has a rank of 4 when all the data is used for training. The each stacked bar corresponds to the singular value σj marked on the x-axis. Partition of a bar represents the choice model distribution of the corresponding singular vector: exp(vj)/(1T exp(vj)), 5. Social classes are F: farmer, AB: middle class. C1: lower middle class, C2: skilled working class, and DE: other working class. Negahban, Oh, Thekumparampil, and Xu Figure 9: Irish Election: Mean loglikelihood on the test data versus fraction of the data used for training. Nuclear norm minimization and Plackett-Luce model estimator tie for the best performance. 6.49 1.2 0.55 0.36 Singular Values Rank Distribution Figure 10: Irish Election: Each bar corresponds to rank distribution of one of the singular 4 values (x-axis) of the bΘ. Heights of the partitions represent the probability with which the distribution ranks the corresponding candidate as first (Section 4.4.5). where exp(vj) is the element-wise exponentiation operator. We see that there are 2 majors voting basis distributions; one favoring Mc Aleese and another favoring Roche. Similar voting blocs have been observed earlier Gormley and Murphy (2009). Even though Placket-Luce model estimator achieves the similar likelihood as our nuclear norm regularized algorithm, the latter helps us in identifying voter basis , solely from rank data without using the side-information on the voters as in Gormley and Murphy (2009). single married divorced (a) Martial status F AB C1 C2 DE (b) Social class Figure 11: Irish Election: t-SNE embedding of rows of the estimated parameter matrix bΘ. The markers correspond to the marital or social status (F: farmer, AB: middle class. C1: lower middle class, C2: skilled working class, DE: other working class) of the rows. Learning from Comparisons and Choices 5. Learning the MNL Model from Choices Choice modeling has had widespread success in numerous application domains such as transportation and marketing Train (1986); Guadagni and Little (1983). Choice models stem from revenue management to tackle the fundamental problem of maximizing expected revenue where the expectation is taken over a probabilistic choice model that is learned from historical purchase data. Revenue management has focused on designing efficient solvers for the optimization problem with exact or approximation guarantees, and has less to do with learning the parameters of probabilistic choice model of interest. In this section, we tackle this unexplored domain of learning choice models from samples with provable guarantees on the sample complexity. In particular, we study learning the MNL model from choices. We study two types of choices under the MNL model that together include all practical scenarios of interest: bundled choice and consumer choice. Bundled choice. We consider a novel scenario of significant practical interest: choice modeling from bundled purchase history. In this setting, we assume that we have bundled purchase history data from n users. Precisely, there are two categories of interest with d1 and d2 alternatives in each category respectively. For example, there are d1 tooth pastes to choose from and d2 tooth brushes to choose from. For the i-th user, a subset Si [d1] of alternatives from the first category is presented along with a subset Ti [d2] of alternatives from the second category. We use k1 and k2 to denote the number of alternatives presented to a single user, i.e. k1 = |Si| and k2 = |Ti|, and we assume that the number of alternatives presented to each user is fixed, to simplify notations. However, the analysis naturally generalizes if the number differs from a user to another user. Given these sets of alternatives, each user makes a bundled purchase, of an item from Si and another item from Ti together, and we use (ui, vi) to denote these bundled pair of alternatives (e.g. a tooth brush and a tooth paste) purchased by the i-th user. Each user makes a choice of the best alternative, independent of other users s choices, according to the MNL model as P {(ui, vi) = (j1, j2)} = eΘ j1,j2 P j 1 Si,j 2 Ti e Θ j 1,j 2 , (47) for all j1 Si and j2 Ti. We emphasize here that the preference matrix is indexed by items of type one (in the rows) and items of type two (in the columns). We are taking the existing standard MNL model over user-item pairs to propose a novel choice model for bundled purchases over two types of items. One could go beyond paired bundled choices and include the user identity as another dimension, or add other types of items and consider higher order bundled purchases. This would require MNL model over higher order tensors, which is outside the scope of this paper, but are interesting generalizations. The main challenge in learning such tensor MNL models is that nuclear norm of a higher order tensor is not a computable quantity and hence minimizing the nuclear norm is not algorithmically feasible Yuan and Zhang (2014). Efficient methods exist based on alternating minimizations, but existing analysis tools can handle only quadratic losses Jain and Oh (2014). The distribution (47) is independent of shifting all the values of Θ by a constant. Hence, there is an equivalent class of Θ that gives the same distribution for the choices: [Θ ] {A Rd1 d2 | A = Θ + c11T for some c R} . Since we can only estimate Θ up to Negahban, Oh, Thekumparampil, and Xu this equivalent class, we search for the one that sums to zero, i.e. P j1 [d1],j2 [d2] Θ j1,j2 = 0. Let α = maxj1,j 1 [d1],j2,j 2 [d2] |Θ j1,j2 Θ j 1,j 2|, denote the dynamic range of the underlying Θ , such that when k1 k2 alternatives are presented, we always have 1 k1k2 e α P {(ui, vi) = (j1, j2)} 1 k1k2 eα , (48) for all (j1, j2) Si Ti and for all Si [d1] and Ti [d2] such that |Si| = k1 and |Ti| = k2. We do not make any assumptions on α other than that α = O(1) with respect to d1 and d2. Assuming Θ is well approximated by a low-rank matrix, we solve the following convex relaxation, given the observed bundled purchase history {(ui, vi, Si, Ti)}i [n]: bΘ arg min Θ Ωα L(Θ) + λ|||Θ|||nuc , (49) where the negative log likelihood function according to (47) is Θ, euie T vi log j1 Si,j2 Ti exp Θ, ej1e T j2 Ωα n A Rd1 d2 |||A||| α, and X j1 [d1],j2 [d2] Aj1,j2 = 0 o . (51) Compared to collaborative ranking, (a) rows and columns of Θ correspond to an alternative from the first and second category, respectively; (b) each sample corresponds to the purchase choice of a user which follow the MNL model with Θ ; (c) each person is presented subsets Si and Ti of items from each category; (d) each sampled data represents the most preferred bundled pair of alternatives. Customer choice. The standard customer choice can be thought of as either a special case of bundled choice or as a special case of higher order comparisons. We consider the standard customer choice data from purchase history. In this setting, we assume that we have purchase history data from d1 users over d2 alternatives. The i-th sample is i.i.d. with user ui chosen uniformly at random and a subset Si [d2] of alternatives of size k. We fix k in order to be efficient in the notations and any variable size offerings can be handled seamlessly. We assume Si is chosen uniformly at random with replacement, in a similar way as bundled choice and higher order comparisons. Given these sets of alternatives, the user ui makes a choice and we use vi to denote the purchased alternative by the i-th (sampled) user. Each user makes a choice of the best alternative, independent of other users s choices, according to the MNL model as P {vi = j2|ui = j1} = eΘ j1,j2 P j 2 Si e Θ j1,j 2 , (52) for all j2 Si. Up to the fact that we index rows by users and not items of one category, this is a special case of the bundled choice model where we fix k1 = 1. Mathematically, all of our results under consumer choices are derived as corollaries from our results under bundled choices, but given the prevalent interest in customer choice models, we emphasize the implications of our framework under customer choice models in a separate section (see Section 5.2). Learning from Comparisons and Choices 5.1 Learning the MNL Model from Bundled Choices We provide an upper bound on the error achieved by our convex relaxation, when the multiset of alternatives Si from the first category and Ti from the second category are drawn uniformly at random with replacement from [d1] and [d2] respectively. Precisely, for given k1 and k2, we let Si = {j(i) 1,1, . . . , j(i) 1,k1} and Ti = {j(i) 2,1, . . . , j(i) 2,k2}, where j(i) 1,ℓ s and j(i) 2,ℓ s are independently drawn uniformly at random over the d1 and d2 alternatives, respectively. Similar to the previous section, this sampling with replacement is necessary for the analysis. Define e2α max{d1, d2} log d n d1 d2 . (53) Theorem 13 Under the described sampling model, assume 16e2α min{d1, d2} log d n and n min{d5, k1k2 max{d2 1, d2 2}} log d, and λ [8λ0, c1λ0] with any constant c1 = O(1) larger than max{8, 128/ p min{k1, k2}}. Then, solving the optimization (49) achieves 2 e2αc1λ r bΘ Θ F + 48e2αc1λ min{d1,d2} X j=r+1 σj(Θ ) , (54) for any r {1, . . . , min{d1, d2}} with probability at least 1 2d 3 where d = (d1 + d2)/2. A proof is provided in Appendix G. Optimizing over r gives the following corollaries. Corollary 14 (Exact low-rank matrices) Suppose Θ has rank at most r. Under the hypotheses of Theorem 13, solving the optimization (49) with the choice of the regularization parameter λ [8λ0, c1λ0] achieves with probability at least 1 2d 3, r(d1 + d2) log d This corollary shows that the number of samples n needs to scale as O(r(d1 + d2) log d) in order to achieve an arbitrarily small error. This is only a logarithmic factor larger than the number of degrees of freedom. For approximately low-rank matrices in an ℓ1-ball as defined in (13), we show an upper bound on the error, whose error exponent reduces from one to (2 q)/2. Corollary 15 (Approximately low-rank matrices) Suppose Θ Bq(ρq) for some q (0, 1] and ρq > 0. Under the hypotheses of Theorem 13, solving the optimization (49) with the choice of the regularization parameter λ [8λ0, c1λ0] achieves with probability at least 1 2d 3, bΘ Θ F 2 ρq d1d2 d1d2(d1 + d2) log d Negahban, Oh, Thekumparampil, and Xu This follows from the same line of proof as in the proof of Corollary 9 in Appendix D. We next, provide a fundamental lower bound on the error, that matches the upper bound up to a logarithmic factor. Theorem 16 Suppose Θ has rank r. Under the described sampling model, there is a universal constant c > 0 such that that the minimax rate where the infimum is taken over all measurable functions over the observed purchase history {(ui, vi, Si, Ti)}i [n] is lower bounded by inf bΘ sup Θ Ωα E h 1 d1d2 e 5α r (d1 + d2) n , α(d1 + d2) d1d2 log d We provide a proof in Appendix H. The first term is dominant, and when the sample size is comparable to the latent dimension of the problem, Theorem 13 is minimax optimal up to a logarithmic factor. We emphasize here that the bound in (55) and the matching lower bound in (57) do not depend on the size of the offerings k1 and k2. It is independent of how large k1 and k2 are because, we only observe one choice, and intuitively the information we get scales at best by a factor of log(k1k2). The theorems prove that there is no essential gain in learning from large offerings. One might be tempted to stop at proving an upper bound that scales as O( p k1k2r(d1 + d2) log d/n), which is larger than (55) by a factor of k1k2. Such a loose bound follows if one ignores the tight concentration analysis that we do using the symmetrization technique (e.g. in Lemma 29). Getting the tight dependency in k1 and k2 is one of the crucial technical challenges we overcome in this paper. 5.2 Learning the MNL Model from Customer Choices The results for the customer choice model follow immediately from the results in bundled choice model by simply setting k1 = 1, and we explicitly write those corollaries in this section for completeness. The proposed estimator is minimax optimal up to a logarithmic factor under the standard customer choice model of sampling. Corollary 17 Under the described sampling model, assume 16e2α min{d1, d2} log d n min{d5, k max{d2 1, d2 2}} log d, and λ [8λ0, c1λ0] with any constant c1 = O(1) larger than 128. Then, solving the optimization (49) achieves 2 e2αc1λ r bΘ Θ F + 48e2αc1λ min{d1,d2} X j=r+1 σj(Θ ) , (58) for any r {1, . . . , min{d1, d2}} with probability at least 1 2d 3 where d = (d1 + d2)/2. Corollary 18 (Exact low-rank matrices) Suppose Θ has rank at most r. Under the hypotheses of Theorem 17, solving the optimization (49) with the choice of the regularization parameter λ [8λ0, c1λ0] achieves with probability at least 1 2d 3, r(d1 + d2) log d Learning from Comparisons and Choices Corollary 19 (Approximately low-rank matrices) Suppose Θ Bq(ρq) for some q (0, 1] and ρq > 0. Under the hypotheses of Theorem 17, solving the optimization (49) with the choice of the regularization parameter λ [8λ, c1λ] achieves with probability at least 1 2d 3, bΘ Θ F 2 ρq d1d2 d1d2(d1 + d2) log d We emphasize again that the bound in (59) does not depend on the size of the offerings k. It is significantly easier to stop at proving an upper bounds that scale as O( p kr(d1 + d2) log d/n), which are larger than (59) by a factor of k. Such a loose bound follows if one ignores the tight concentration analysis that we do using the symmetrization technique (e.g. in Lemma 29). Getting the tight dependency in k is one of the crucial technical challenges we overcome in this paper. 5.3 Experiments We applied our algorithm to a real world choice data set. The implementation is similar to that of higher order comparisons (see Section 4.4.1). 5.3.1 Real data: Extended Bakery Extended Bakery data set 6 Benson et al. (2018) consists of details of 75,000 purchases at a bakery from a selection of 50 items, specifically each purchase is recorded by the set of items bought together. We use only the 13,579 purchases where a pair of items where bought. We divide the items into five categories: cakes (1-10), tarts (11-21), cookies (22-30), pastries (31-40) and drinks (41-50). We study four cases of bundled pairs, cakes and drinks, tarts and drinks, cookies and drinks, and pastries and drinks. These cases have 1503, 910, 500 and 1791 purchases in them respectively. We model the data with an MNL model parameterized by a matrix Θ , such that rows and columns corresponds to first and second categories respectively. For every purchase we assume that the subset of alternatives presented is the universal choice set, that is k1 = d1 and k2 = d2, so that the purchase of item j1 from category 1 and j1 from category 2 has a probability of exp(Θ j1,j2)/Pd1,d2 j 1,j 2=1 exp(Θ j 1,j 2), where d1, d2 are number of items in category 1 and 2 respectively. We also fit the separable model proposed in Benson et al. (2018), which is a simpler model with Θ j,j2 = a j1 + b j2, where a Rd1, b Rd2. In Fig. 12 we plot the mean log-likelihood on the test data versus the fraction of the data used for our nuclear norm minimization based algorithm ( nucnorm ), un-regularized log-likelihood maximization algorithm ( fullrank ), and maximum likelihood estimator for the separable model ( separable ) over 10 trials. We see that nuclear norm minimization outperforms the fullrank and separable algorithms. This is consistent with the marketing practice, of providing different prices for bundled combinations of products, which uses the rationale that, worth of a bundle of products might be different from the sum of the worths 6. Data set is from https://github.com/arbenson/discrete-subset-choice/tree/master/data. Negahban, Oh, Thekumparampil, and Xu (a) Cakes and Drinks (b) Tarts and Drinks (c) Cookies and Drinks (d) Pastries and Drinks Figure 12: Bakery: Mean log-likelihood on test data versus fraction of data used for training. Nuclear norm minimization improves upon the un-regularized likelihood maximization and separable model, for most of the regimes we consider. of the individual products constituting the bundle. We also note that the un-regularized algorithm ( fullrank ) has the worst performance and high variance and separable model fits the samples almost as good as the nuclear norm minimization in the case of cookies and drinks in the large training data regime. 6. Conclusion The sample complexity of learning one of the most popular choice models known as Multi Nomial Logit model has not been addressed in the literature. The main challenge is in the inherent low-rank structure of the parameter to be learned, which leads to a non-convex likelihood maximized problem. Thanks to recent advances in learning low-rank matrices, in particular in 1-bit matrix completion Davenport et al. (2014), matrix completion Negahban and Wainwright (2012), and restricted strong convexity Negahban et al. (2009), we have a polynomial time algorithm and the technical tools to characterize the fundamental sample complexity of learning MNL from samples. This provides a novel algorithm to learn a Learning from Comparisons and Choices low-dimensional representation of users and items from users historical comparisons and choices. We study three types of data, pairwise comparison, higher order comparison, and choices, and take the first principle approach of identifying the fundamental limits and also developing efficient algorithms matching those fundamental trade offs. We provide a unifying framework to learn the latent preferences by solving a convex program. For each of the data types, accompanied by natural sampling scenarios, we show that our framework achieves a minimax optimal performance, and hence cannot be improved upon other than a small logarithmic factor. This opens a new door to learn representations from comparisons and choices, and we propose new research directions and challenges below. Beyond the low-rank model studied in this paper, recent advances in modeling data in a matrix form such as low algebraic dimension by Ongie et al. (2018) and non-parametric approximation by Borgs et al. (2017) can provide new research directions for modeling choice. Efficient implementations via non-convex optimization. Nuclear norm minimization, while polynomial-time, is still slow. We want first-order methods that are efficient with provable guarantees. Two main challenges are providing a good initialization to start such non-convex approaches and analyzing gradient descent on the likelihood maximization which is non-convex. Recent advances in non-convex optimization with rank-constraints have developed via a sequence of innovations that can be summarized as follows, in a number of example problems including matrix completion, robust PCA, matrix sensing, phase retrieval. First, a convex relaxation of nuclear norm minimization is analyzed, e.g. Cand es and Recht (2009). Then, a more efficient two-step non-convex optimization approach is proposed with provable guarantees where a global initialization step is followed by a first-order method e.g. Keshavan et al. (2010a,b). Next, first-order methods starting at any initialization point is analyzed via understanding the geometry and checking the stationary points of the objective function e.g. Ge et al. (2016). This recipe, spurred by the advances in the matrix completion problem, has been repeated for several interesting problems involving low rank matrices, over the last decade and over numerous publications by collective effort of the machine learning community. For the problem of learning MNL, we are at the first stage of this progression where we propose a convex relaxation and provide minimax optimal guarantees. We currently do not have the analysis tools to follow up in analyzing an efficient non-convex optimization problem, although writing the algorithm and implementing is straight forward, and also has been proposed in Park et al. (2015). It is a promising research direction to overcome the challenges in analyzing non-convex optimization methods for the MNL likelihood objective function. Assumption on sampling with replacement. As mentioned earlier, we assume sampling with replacement, where we can ask a user to compare the same pair more than once, and also we can ask a user to compare two copies of the identical item. Although such sampling with replacement does not happen in practice, the number of such collisions is also very low with high probability under the proposed model. Further, such assumption is critical for getting an upper bound that is tight not only in r, d1 and d2, but also in k for higher order comparisons and choices. If, instead, one is interested in sampling without Negahban, Oh, Thekumparampil, and Xu replacement, then one either can resort to proving a loose bound that is weaker in its dependence in k (and follows trivially as a corollary of the proof of our results) or needs to invent new innovative concentration bounds that do not rely on the powerful symmetrization. The first option is trivial, so we do not provide such corollaries in this paper, and the second option provides an interesting but technically challenging question of resolving between sampling with replacement and sampling without replacement. This we believe is outside the scope of this paper. Modern data analysis applications. As learning representation from ordinal data is of fundamental interest, there are numerous exciting applications that both the algorithmic framework and also the analysis techniques we develop could be naturally extended to. We present two such examples. First is a recent application of embedding objects with crowdsourced similarity measures, first proposed in Tamuz et al. (2011). Consider a crowdsourcing setting where you have d images and want to learn similarities among those images such that one can embed those images in a lower dimensional Euclidean space. One can show to a person a triplet of images (i1, i2, i3) and ask whether the image in the middle is more similar to the one in the left or the right. A natural model proposed in Tamuz et al. (2011) is to assume that there exists a similarity parameter matrix Θ Rd d such that P {i1 is more similar to i2 than i3 is to i2} = eΘi2,i1 eΘi2,i1 + eΘi2,i3 . A heuristic algorithm is proposed to learn a low-rank Θ without guarantees. Given the similarity of this model to MNL in (1), both our algorithm and also the analysis will go through to provide a tight characterization of the sample complexity of this problem. The second application is in word embedding Mikolov et al. (2013b), where the goal is to find embeddings for English words in a lower dimensional Euclidean space. The most successful word embedding has been based on fitting a low-rank matrix Θ Rd d where d is the size of the vocabulary, over an MNL-type model: P {word i and word j appear within distance ten|word j appear in a sentence} = eΘij P i eΘi j . As the denominator involves summation over millions of words in the vocabulary, efficient heuristics are proposed to learn such a model from skip-grams; a skip-gram is the count matrix counting how many times words co-appear in the same sentence within a predefined distance. There are several challenges in applying our framework directly to such a setting mainly due to the size of the problem, but nevertheless our analysis can be applied directly to identify the fundamental minimax sample complexity of learning a word embedding from skip-grams. Acknowledgments SN acknowledges support from NSF Grant DMS 1723128. SO and KT acknowledge support from NSF grants CCF 1553452, CNS 1527754, CCF 1705007, and RI 1815535. The authors acknowledge Yu Lu and Ruoyu Sun for helpful and fruitful discussions. Learning from Comparisons and Choices Appendix A. Proof of the Upper Bound for Graph Sampling Theorem 1 The proof of the theorem relies on the following two lemmas. First lemma shows that the negative of the log-likelihood satisfies Restricted Strong Convexity with high probability. Lemma 20 (Restricted Strong Convexity) Let R = max q n , σmin(L) 1/2 log(2d) and the set A(α) = Θ Rd1 d2, |||Θ||| α, |||Θ|||L-nuc |||ΘL1/2||| 2 F 16αd1R . Then we have, i=1 ( Θ, Xi )2 1 3d1 |||Θ|||2 L, Θ A(α), (61) with probability at least 1 2(2d) 4, provided that n log (2d) min{22(d1σmin(L) 1)2/3, 26d2 1σ2}, . Here the upper bound on n may not be necessary, but it is present due to a technical difficulty in using the peeling argument. The intuition behind the above lemma is that the empirical average uniformly concentrates around its expectation. Proof is in Section A.1. The next lemma says that the gradient of the log-likelihood at the actual parameter matrix, Θ is controllably small. Lemma 21 (Bounded Gradient) Let R = max q n , σmin(L) 1/2 log(2d) spectral norm of gradient of the log-likelihood at the actual parameter matrix, L(Θ ), can be upper-bounded with high probability as follows, P n L(Θ )L 1/2 2 32R o 1 (d1 + d2)3 (62) Proof the above lemma is in Section A.4. Let = ˆΘ Θ . Case 1: / A(2α) Then, ||| |||2 L 32αd1R||| |||L-nuc Case 2: A(2α) We first write down the second order Taylor series expansion of L(ˆΘ) at around Θ = Θ . L(ˆΘ) = L(Θ ) + L(Θ ), + 1 Θ , X(i) + s , X(i) where ψ(x) = ex/(1+ex)2, x [ 2α, 2α] and s [0, 1]. Next using Lemma 20 and the fact that ψ(x) attains minimum at x = 2α we get, L(ˆΘ) + L(Θ ) + L(Θ ), 1 i=1 ψ(2α) , X(i) 2 ψ(2α) 6d1 ||| |||2 L, (64) Negahban, Oh, Thekumparampil, and Xu with probability at least 1 1/(d1+d2)3. Since ˆΘ is the minimizer for the objective function 9, we have, L(ˆΘ) + λ ˆΘ L-nuc L(Θ ) + λ|||Θ |||L-nuc, which in turn gives us, 6d1 ||| |||2 L L(ˆΘ) + L(Θ ) + L(Θ ), λ |||Θ |||L-nuc ˆΘ L-nuc λ (||| |||L-nuc) + L(Θ )L 1/2, L1/2 λ (||| |||L-nuc) + L(Θ )L 1/2 2 ||| |||L-nuc, (66) where last two inequalities follow from the triangle inequality for nuclear norm and generalized H older s inequality. Now we put λ = 2 32R and use Lemma 21 to get, ||| |||2 L 6d1 ψ(2α) ||| |||L-nuc 9d1λ ψ(2α)||| |||L-nuc, (67) with probability at least 1 1/(d1 + d2)3. Combining Case 1 and 2 we get, ||| |||2 L 9 α + 1 ψ(2α) d1λ||| |||L-nuc Lemma 22 If λ 2||| L(Θ )|||2, then we have ||| |||L-nuc 4 2r||| |||L + 4 min{d1,d2 G} X j=r+1 σj(Θ L1/2) , (68) for all r [min{d1, d2 G}]. (Proof in Section A.5) Finally, utilizing the above lemma, we get, 1 d1 ||| |||2 L 36λ α + 1 ψ(2α) 2r||| |||L + min{d1,d2 G} X j=r+1 σj(Θ L1/2) A.1 Proof of Lemma 20 2 1 3d1 |||Θ|||2 L, Θ A Θ A, such that 1 2 < 1 3d1 |||Θ|||2 L Learning from Comparisons and Choices |||Θ|||2 L 16αd1R|||Θ|||L-nuc 16αd1R|||Θ|||L = |||Θ|||L 16αd1R := µ , (70) where the second inequality follows from |||Θ|||L-nuc = Θ L1/2 nuc Θ L1/2 F = |||Θ|||L. Lemma 23 Let B(D) := n Θ Rd1 d2||||Θ||| α, |||Θ|||L D, |||Θ|||L-nuc D2 16αd1R o , and, ZD := sup Θ B(D) d1 |||Θ|||2 L , then, P ZD 3 2d1 D2 exp n D4 Above lemma is proved in Section A.2. Let β = q 9 , then the sets, Sℓ= Θ Rd1 d2||||Θ||| α, βℓ 1µ |||Θ|||L βℓµ, |||Θ|||L-nuc (βℓµ)2 , ℓ= 1, 2, 3, . . . , cover the set A, that is A ℓ=1Sℓand Sℓ B(βℓµ). This gives us, 2 < 1 3d1 |||Θ|||2 L 2 < 1 3d1 |||Θ|||2 L Θ B(βℓµ) s.t. 1 2 < 1 3d1 |||Θ|||2 L If there exists a Θ B(βℓµ) such that 1 2 < 1 3d1 |||Θ|||2 L then, d1 |||Θ|||2 L > 5 3d1 |||Θ|||2 L 5 3d1 β2ℓ 2µ2 = 3 2d1 (βℓµ)2, which gives us, 2 < 1 3d1 |||Θ|||2 L ℓ=1 P Zβℓµ > 3 2d1 (βℓµ)2 ℓ=1 exp n(βℓµ)4 ℓ=1 exp 4ℓ(β 1)nµ4 (c) 2 exp 4(β 1)nµ4 Negahban, Oh, Thekumparampil, and Xu where (a) is from Lemma 23, (b) is true since β4ℓ 4ℓ(β 1) when β 1 and (c) is obtained by summing the geometric series, with common ratio less than 1/2, in previous inequality. Finally if we assume that n 26d2 1σ2 log (2d) and n 22(d1σmin(L) 1)2/3 log (2d), then we have 22 log(2d) 4(β 1)nµ4/32α4d2 1 as follows 32α4d2 1 = 4(β 1)n(16αd1R)4 32α4d2 1 = 213(β 1)d2 1 max σ2 log2(2d) n , σmin(L) 2 log4(2d) 22 log(2d) (75) A.2 Proof of Lemma 23 Notice that the 2 d1 |||Θ|||2 L is the mean of 1 n Pn i=1 Θ, X(i) 2, Θ, X(i) 2 # k,l [d2] (Θj,k Θj,l)2Pk,l k,l Θj,kΘj,l Pk,l ΘjΘT j , diag(Pk) 2 ΘjΘT j , P ΘjΘT j , L = 2 where, in (a) Pk = P l [d2] Pk,l and Θj is the j-th row of Θ. Therefore we use the following standard technique to get a handle on supremum of deviation from mean. First, we use bounded differences property to prove that ZD concentrates around its mean. We write ZD(X(1), . . . , X(n)) to represent ZD as a function of n independent random variables. Now, let X(i) and X(i) be two realization of the i-th (1 i n) random parameter of ZD, then, ZD(X(1), . . . , X(i), . . . , X(n)) ZD(X(1), . . . , X(i), . . . , X(n)) Θ, X(i) 2 + 2 d1 |||Θ|||2 L Θ , X(i) 2 + Θ , X(i ) 2 Learning from Comparisons and Choices Now WLOG assume that ZD(X(1), . . . , X(i), . . . , X(n)) ZD(X(1), . . . , X(i), . . . , X(n)) and the first supremum is achieved at Θ, which gives us. = sup Θ B(D) Θ, X(i) 2 + 2 d1 |||Θ|||2 L Θ , X(i) 2 + Θ , X(i ) 2 Θ, X(i) 2 + 2 Θ, X(i) 2 + Θ, X(i ) 2 Θ, X(i) 2 Θ, X(i) 2 where the last inequality is true since, |||Θ||| α for any Θ B(D) Ωα. Now we upper bound E [ZD] as follows, E [ZD] (a) 2E i=1 εi Θ, X(i) 2 # i=1 εi ΘL1/2, X(i)L 1/2 sup Θ B(D) |||Θ|||L-nuc i=1 εi X(i)L 1/2 2 4α sup Θ B(D) |||Θ|||L-nuc E i=1 εi X(i)L 1/2 2 where (a) is standard symmetrization argument using i.i.d. Rademacher variables {εi}n i=1 and since | Θ, X(i) | 2α we use Ledoux-Talagrand contraction inequality Ledoux and Talagrand (2013) to obtain (b). Lemma 24 Let R = max q n , σmin(L) 1/2 log(2d) . For {X(i)}n i=1 as defined in the graph sampling and for a binary random variable εi such that E εi|X(i) = 0 and |εi| 1, we have, i=1 εi X(i)L 1/2 2 1 (d1 + d2)3 , and, E i=1 εi X(i)L 1/2 2 Negahban, Oh, Thekumparampil, and Xu Proof of the lemma is in Section A.3. Now using Lemma 24 we have E [ZD] 16Rα supΘ B(D) |||Θ|||L-nuc D2 d1 . Then using the bounded differences property and the upper bound on the mean, we get the Mc Diarmid s concentration, P ZD D2/d1 t P {ZD E [ZD] t} and putting t = D2/2d1 gives us the theorem. A.3 Proof of Lemma 24 Let Wi := 1 nεi X(i)L 1/2 = 1 nεiej(i) ek(i) el(i) T L 1/2 and pseudo-inverse of L be L , then, Wi 2 σmin(L) 1/2 E Wi W T i = E 1 n2 ej(i) ek(i) el(i) T L 1/2L 1/2 ek(i) el(i) e T j(i) n2 ej(i)e T j(i) E h ek(i) el(i) T L ek(i) el(i) i = 1 n2d1 Id1 d1 2 E h e T k(i)L ek(i) i E h e T k(i)L el(i) i u [d1] Pu L u,u X u,v [d1] Pu,v L u,v L, L Id1 d1 n2d1 Id1 d1, (80) E W T i Wi = L 1/2E 1 n2 ek(i) el(i) ek(i) el(i) T L 1/2 u,v=1 (eu ev) (eu ev)T Pu,v n2 L 1/2 (2L) L 1/2 n2 (Id2 d2 X k [G] gkg T k /|||gk|||2 2) , and, (81) i=1 Wi W T i i=1 W T i Wi i=1 max E Wi W T i 2 , E W T i Wi 2 Learning from Comparisons and Choices where σ = max n d2 G d1 , 1 o . Now by Matrix Bernstein concentration theorem Tropp (2015) we have, i=1 εi X(i)L 1/2 2 t exp nt2/2 2σ + 2σmin(L) 1/2t/3 , and, (84) i=1 εi X(i)L 1/2 2 4σ log(d1 + d2) 3n log(d1 + d2) . (85) Choosing t = max q 24σ log(d1+d2) 2σmin(L) 1 log(d1+d2) produces the desired result. A.4 Proof of Lemma 21 The gradient can be written down as, yi exp( Θ , X(i) ) 1 + exp( Θ , X(i) ) Then Lemma 24 directly gives the result because, yi exp( Θ , X(i) ) 1 + exp( Θ , X(i) ) yi exp( Θ , X(i) ) 1 + exp( Θ , X(i) ) A.5 Proof of Lemma 22 Denote the singular value decomposition of Θ L1/2 by Θ L1/2 = UΣV T , where U Rd1 d1 and V Rd2 d2 are orthogonal matrices. For a given r [min{d1, d2 G}], let Ur = [u1, . . . , ur] and Vr = [v1, . . . , vr], where ui Rd1 1 and vi Rd2 1 are the left and right singular vectors corresponding to the i-th largest singular value, respectively. Define T to be the subspace spanned by all matrices in Rd1 d2 of the form Ur AT or BV T r for any A Rd2 r or B Rd1 r, respectively. The orthogonal projection of any matrix M Rd1 d2 onto the space T is given by PT (M) = Ur UT r M + MVr V T r Ur UT r MVr V T r . The projection of M onto the complement space T is PT (M) = (I Ur UT r )M(I Vr V T r ). The subspace T and the respective projections onto T and T play crucial a role in the analysis of nuclear norm minimization, since they define the sub-gradient of the nuclear norm at Θ . We refer to Cand es and Recht (2009) for more detailed treatment of this topic. Let = PT ( L1/2) and = PT ( L1/2). Notice that PT (Θ L1/2) = UrΣr V T r , where Σr Rr r is the diagonal matrix formed by the top r singular values. Since PT (Θ L1/2) and have row and column spaces that are orthogonal, it follows from Lemma 2.3 in Recht et al. (2010) that PT (Θ L1/2) nuc = PT (Θ L1/2) nuc + nuc . Negahban, Oh, Thekumparampil, and Xu Hence, in view of the triangle inequality, bΘL1/2 nuc = PT (Θ L1/2) + PT (Θ L1/2) nuc PT (Θ L1/2) nuc PT (Θ L1/2) nuc = PT (Θ L1/2) nuc + nuc PT (Θ L1/2) nuc PT (Θ L1/2) nuc + nuc PT (Θ L1/2) nuc nuc = Θ L1/2 nuc + nuc 2 PT (Θ L1/2) nuc nuc. (87) Because bΘ is an optimal solution, we have λ bΘL1/2 nuc Θ L1/2 nuc L(Θ ) + L(bΘ) (a) L1/2, L(Θ )L 1/2 (b) ||| |||L-nuc L(Θ )L 1/2 2 λ 2 ||| |||L-nuc, (88) where (a) holds due to the convexity of L; (b) follows from the Cauchy-Schwarz inequality; the last inequality holds due to the assumption that λ 2||| L(Θ )|||2. Combining (87) and (88) yields 2 nuc 2 PT (Θ L1/2) nuc nuc ||| |||L-nuc nuc + nuc. Thus ||| |||nuc 3||| |||L-nuc + 4 PT (Θ L1/2) nuc. By triangle inequality, ||| |||L-nuc 4 nuc + 4 PT (Θ L1/2) nuc . Notice that = Ur UT r L1/2 + (I Ur UT r ) L1/2Vr V T r . Both Ur UT r L1/2 and (I Ur UT r ) L1/2Vr V T r have rank at most r. Thus has rank at most 2r. Hence, ||| |||nuc 2r||| |||L. Then the theorem follows because PT (Θ L1/2) nuc = Pmin{d1,d2} j=r+1 σj(Θ L1/2). Appendix B. Proof of the Information-theoretic Graph Sampling Lower Bound, Theorem 4 The proof uses Fano Inequality based packing set argument to get an lower bound on the error of any (measurable) estimator. We will construct a packing set in Ωα with a minimum distance of δ between any pair of elements in the packing. Let {Θ(1), Θ(2), . . . , Θ(M)} be a set of M matrices within the set Ωα, satisfying Θ(ℓ1) Θ(ℓ1) L δ for all ℓ1, ℓ2 [M]. Now, Θ(N) is uniformly drawn from this set and then the comparison results (according to MNL model) of n randomly chosen pairs of items, Learning from Comparisons and Choices each drawn according to the probability matrix P and each compared by uniformly chosen user. Let b N be the best estimator of N from the observations. Then we can show that, sup Θ Ωα P bΘ Θ 2 P n b N = N o , (89) Now we have converted the problem of finding the minimum estimation error, into finding the minimum probability error of a M-ary hypothesis testing problem. If we can prove that the above RHS is lower bounded by 1/2, we are done. The generalized Fanos inequality along with data processing inequality gives us, P n b N = N o 1 E[ I( b N; N) ] + log 2 M 2 1 P ℓ1,ℓ2 [M] DKL(Θ(ℓ1) Θ(ℓ2)) + log 2 log M , (91) where DKL(Θ(ℓ1) Θ(ℓ2)) denotes the expected Kullback-Leibler divergence between the probability distributions of the comparison results of the observed nd1 pairs, for N = ℓ1 and N = ℓ2. The expectation is taken over different choices for the selected pairs for comparison. DKL(Θ(ℓ1) Θ(ℓ2)) = n i [d1] {j,j } [d2] " eΘ(ℓ1) ij eΘ(ℓ1) ij + eΘ(ℓ1) ij log eΘ(ℓ1) ij , eΘ(ℓ1) ij + eΘ(ℓ1) ij eΘ(ℓ2) ij , eΘ(ℓ2) ij + eΘ(ℓ2) ij + eΘ(ℓ1) ij eΘ(ℓ1) ij + eΘ(ℓ1) ij log eΘ(ℓ1) ij , eΘ(ℓ1) ij + eΘ(ℓ1) ij eΘ(ℓ2) ij , eΘ(ℓ2) ij + eΘ(ℓ2) ij where n is the number of pairs of items selected and compared by one random user each, Pj,j is half the probability with which item pair {j, j } is selected and the observation probabilities come from the standard MNL model. Let xijj eΘ(ℓ1) ij /(eΘ(ℓ1) ij + eΘ(ℓ1) ij ) and yijj eΘ(ℓ2) ij /(eΘ(ℓ2) ij + eΘ(ℓ2) ij ). Negahban, Oh, Thekumparampil, and Xu DKL(Θ(ℓ1) Θ(ℓ2)) (a) = n X {j,j } [d2] 2Pj,j xijj log xijj yijj + (1 xijj ) log 1 xijj {j,j } [d2] 2Pj,j xijj xijj yijj yijj + (1 xijj )yijj xijj {j,j } [d2] (xijj yijj )Pj,j (xijj yijj ) yijj (1 yijj ) (96) (b) 8ne2α X {j,j } [d2] (xijj yijj )Pj,j (xijj yijj ), (97) where (a) is due to the fact that log(x/y) (x y)/y for x/y 0 and (b) is true because |Θ(ℓ2) ij | α implies, yijj = eΘ(ℓ2) ij /(eΘ(ℓ2) ij + eΘ(ℓ2) ij ) e 2α/2 which in turn implies, yijj (1 yijj ) e 2α(2 e 2α)/4 e 2α/4. Let f(z) = 1/(1 + e z), a 1-Lipschitz function, it can be seen that (xijj yijj )2 = (f(Θ(ℓ1) ij Θ(ℓ1) ij ) f(Θ(ℓ2) ij Θ(ℓ2) ij ))2 ((Θ(ℓ1) ij Θ(ℓ1) ij ) (Θ(ℓ2) ij Θ(ℓ2) ij ))2. This gives us, DKL(Θ(ℓ1) Θ(ℓ2)) 8ne2α {j,j } [d2] Pu,v((Θ(ℓ1) ij Θ(ℓ2) ij ) (Θ(ℓ1) ij Θ(ℓ2) ij ))2, (98) i [d1] (Θ(ℓ1) Θ(ℓ2))i L(Θ(ℓ1) Θ(ℓ2))i, (99) i [d1] (Θ(ℓ1) Θ(ℓ2))i L(Θ(ℓ1) Θ(ℓ2))i, (100) (Θ(ℓ1) Θ(ℓ2))L1/2 2 Θ(ℓ1) Θ(ℓ2) 2 where (a) is due to the fact that L = diag(Pu) P is the Laplacian of the probability matrix P, and Θi denotes the i-th row of matrix Θ. Combining the above with (91), we get, P n b N = N o 1 M 2 1 P ℓ1,ℓ2 [M](8ne2α/d1) Θ(ℓ1) Θ(ℓ2) 2 L + log 2 log M . (104) The remainder of the proof relies on the following probabilistic packing. Learning from Comparisons and Choices Lemma 25 For each r {1, . . . , d1}, and for any positive δ > 0 there exists a family of d1 d2 dimensional matrices {Θ(1), . . . , Θ(M(δ))} with cardinality M(δ) = exp(rd1/256) such that each matrix is rank r and the following bounds hold: Θ(ℓ) L δ , for all ℓ [M] (105) Θ(ℓ1) Θ(ℓ2) L δ , for all ℓ1, ℓ2 [M] (106) Θ(ℓ) Ω α , for all ℓ [M] , (107) with α = (8δ/d2) 2 log d for d = (d1 + d2)/2. Now if we assume δ αd2/8 2 log d, we get Θ(ℓ) Ωα for ℓ [M]. The above lemma also implies that Θ(ℓ1) Θ(ℓ2) 2 F 4δ2 which implies, P n b N = N o 1 32ne2αδ2/d1 + log 2 where the last inequality holds when δ (e α/128) p rd2 1/n. Along with (89), this proves that, inf bΘ sup Θ Ωα E h bΘ Θ L for all δ min{αd2/8 2 log d, (e α/128) p rd2 1/n}. Similarly using the following lemma, Lemma 26, we can prove that, inf bΘ sup Θ Ωα E h bΘ Θ L for all δ min{α rd1/tr p (Lr) , (e α/128) p Lemma 26 For each r {1, . . . , d1}, and for any positive δ > 0 there exists a family of d1 d2 dimensional matrices {Θ(1), . . . , Θ(M(δ))} with cardinality M(δ) = exp(rd1/256) such that each matrix is rank r and the following bounds hold: Θ(ℓ) L δ , for all ℓ [M] (111) Θ(ℓ1) Θ(ℓ2) L δ , for all ℓ1, ℓ2 [M] (112) Θ(ℓ) Ω α , for all ℓ [M] , (113) with α = δ p tr ((Lr) )/ rd1, where Lr is the (best) rank r approximation of L. Now combining (109) and (110), and maximizing the RHS proves the theorem. Negahban, Oh, Thekumparampil, and Xu B.1 Proof of Lemma 25 Following the construction in Negahban and Wainwright (2012), we use probabilistic method to prove the existence of the desired family. We will show that the following procedure succeeds in producing the desired family with probability at least half, which proves its existence. Let d = (d1 + d2)/2, and suppose d2 d1 without loss of generality. For the choice of M = erd2/576, and for each ℓ [M ], generate a rank-r matrix Θ(ℓ) Rd1 d2 as follows: Θ(ℓ) = δ rd2 U(V (ℓ))T Id2 d2 X gig T i g T i gi where U Rd1 r is a random orthogonal basis such that UT U = Ir r and V (ℓ) Rd2 r is a random matrix with each entry V (ℓ) ij { 1, +1} chosen independently and uniformly at random. By construction, notice that, Θ(ℓ) 2 (V (ℓ))T L1/2 2 i [d2] (V (ℓ) i )T LV (ℓ) i δ2 r d2 |||L|||2 X where V (ℓ) i is the i-th column of V (ℓ), since gi G i=1 span the null space of the Laplacian L, |||L|||2 1, and V (ℓ) F = rd2. Now, consider Θ(ℓ1) Θ(ℓ2) 2 F = (δ2/(rd2)) (V (ℓ1) V (ℓ2))T 2 L f(V (ℓ1), V (ℓ2)) which is a function over 2rd2 i.i.d. random Rademacher variables V (ℓ1) and V (ℓ2) which define Θ(ℓ1) and Θ(ℓ2) respectively. Since f is Lipschitz in the following sense, we can apply Mc Diarmid s concentration inequality. For all (V (ℓ1), V (ℓ2)) and (e V (ℓ1), e V (ℓ2)) that differ in only one variable, say e V (ℓ1) = V (ℓ1) + 2eij, for some standard basis matrix eij, we have f(V (ℓ1), V (ℓ2)) f(e V (ℓ1), e V (ℓ2)) = δ2 (V (ℓ1) V (ℓ2))T 2 (V (ℓ1) V (ℓ2) + 2eij)T 2 2e T i L(V (ℓ1) j V (ℓ2) j ) + 4e T i Lei (117) r d2 (2|||Li|||1 V (ℓ1) j V (ℓ2) j + 4Pi) (118) r d2 , (119) where we used the fact that |||Li|||1 = 2Pi 2 and V (ℓ1) V (ℓ2) is entry-wise bounded by 2. The expectation E[f(V (ℓ1), V (ℓ2))] is r d2 E (V (ℓ1) V (ℓ2))T 2 r d2 E (V (ℓ1))T 2 2δ2 , (120) where we use equation (115). Applying Mc Diarmid s inequality with bounded difference 12δ2/(rd2), we get that P n f(V (ℓ1), V (ℓ2)) 2δ2 t o exp n t2 r d2 Learning from Comparisons and Choices Since there are less than (M )2 pairs of (ℓ1, ℓ2), setting t = δ2 and applying the union bound gives P min ℓ1,ℓ2 [M ] Θ(ℓ1) Θ(ℓ2) 2 F δ2 1 exp n r d2 144 + 2 log M o 7 where we used M = exp{rd2/576} and d2 607. We are left to prove that Θ(ℓ) s are in Ω(8δ/d2) 2 log d2 as defined in (10). Since we removed the mean for each connected component, such that Θ(ℓ)gi = 0, i [G] by construction, we only need to show that the maximum entry is bounded by (8δ/d2) 2 log d2. We first prove an upper bound in (124) for a fixed ℓ [M ], and use this to show that there exists a large enough subset of matrices satisfying this bound. From (114), consider (UV T )ij = ui, vj , where ui Rr is the first r entries of a random vector drawn uniformly from the d2-dimensional sphere, and vj Rr is drawn uniformly at random from { 1, +1}r with vj = r. Using Levy s theorem for concentration on the sphere Ledoux (2005), we have P {| ui, vj | t} 2 exp n d2 t2 Notice that by the definition (114), maxi,j |Θ(ℓ) ij | (2δ/ rd2) maxi,j | ui, vj |. Setting t = p (32r/d2) log d2 and taking the union bound over all d1d2 indices, we get P max i,j |Θ(ℓ) ij | 2δ 32 log d2 1 2d1d2 exp n 4 log d2 o 1 for a fixed ℓ [M ]. Consider the event that there exists a subset S [M ] of cardinality M = (1/4)M with the same bound on maximum entry, then from (124) we get P S [M ] such that Θ(ℓ) 2δ 32 log d2 d2 for all ℓ S which is larger than half for our choice of M < M /2. B.2 Proof of Lemma 26 Inspired from the construction in Negahban and Wainwright (2012), we furnish the following probabilistic argument for the existence of the desired family. For the choice of M = erd1/256 , and for each ℓ [M], generate a rank-r matrix Θ(ℓ) Rd1 d2 as follows: Θ(ℓ) = δ rd1 V (ℓ) q Λ r UT r , (126) where the columns of Ur Rd2 r are the top r singular vectors of L = UΛUT , Λr is a diagonal matrix in Rr r and its diagonal elements are the top r singular values of L corresponding to columns of Ur, represents the Moore-Penrose pseudo inverse, and V (ℓ) is a random matrix with each entry V (ℓ) ij { 1, +1} chosen independently and uniformly at random. First by definition, Θ(ℓ) L = (δ/ rd1) V (ℓ) F δ, since V (ℓ) F = rd1. Negahban, Oh, Thekumparampil, and Xu Define f as f(V (ℓ1), V (ℓ2)) Θ(ℓ1) Θ(ℓ2) 2 L = (δ2/(rd1)) V (ℓ1) V (ℓ2) 2 F which is a function of 2rd1 i.i.d. random Rademacher variables. Now we can apply Mc Diarmid s concentration inequality since f is Lipschitz as folows. For all (V (ℓ1), V (ℓ2)) and (e V (ℓ1), e V (ℓ2)) that differ in only one variable, say e V (ℓ1) = V (ℓ1) + 2eij, for some standard basis matrix eij, we have f(V (ℓ1), V (ℓ2)) f(e V (ℓ1), e V (ℓ2)) V (ℓ1) V (ℓ2) 2 V (ℓ1) V (ℓ2) + 2eij 2 r d2 |||2eij|||2 F + δ2 (V (ℓ1) V (ℓ2)), 2eij V (ℓ1) V (ℓ2) |||2eij|||1 r d1 , (127) where the penultimate step is true since (V (ℓ1) V (ℓ2)) is entry-wise bounded by 2. The expectation E[f(V (ℓ1), V (ℓ2))] is r d1 E (V (ℓ1) V (ℓ2)) 2 r d1 E V (ℓ1) 2 = 2 δ2 . (128) Now applying Mc Diarmid s inequality on the function f, we get that P n f(V (ℓ1), V (ℓ2)) 2δ2 t o exp n t2 r d1 Setting t = δ2 and applying the union bound gives us, P min ℓ1,ℓ2 [M] Θ(ℓ1) Θ(ℓ2) 2 F δ2 1 exp n r d1 64 + 2 log M o > 0 . (130) In the last step, we used M = exp{rd1/ 256} . At last we prove that Θ(ℓ) s are in Ωδ q tr((Lr) )/rd1 as defined in (10). Since we know that gi belongs to the kernel of L for all i [G], Θ(ℓ)g = 0 by construction (6). From (126), consider (V q Λ r UT r )ij = vi, q Λ r(ur)j , where (ur)j Rr is the vector of i-th entries of the top r singular vectors of L, and vi Rr is drawn uniformly at random from { 1, +1}r. tr ((Lr) ) . (131) The above inequality proves that Θ(ℓ) is upper bounded as desired. Learning from Comparisons and Choices Appendix C. Proof of Theorem 7 We first introduce some additional notations used in the proof. Recall that L(Θ) is the log likelihood function. Let L(Θ) Rd1 d2 denote its gradient such that ij L(Θ) = L(Θ) Let 2L(Θ) Rd1d2 d1d2 denote its Hessian matrix such that 2 ij,i j L(Θ) = 2L(Θ) Θij Θi j . By the definition of L(Θ) in (32), we have ℓ=1 ei(evi,ℓ pi,ℓ)T , (132) where pi,ℓdenotes the conditional choice probability at ℓ-th position. Precisely, pi,ℓ= P j Si,ℓpj|(i,ℓ)ej where pj|(i,ℓ) is the probability that item j is chosen at ℓ-th position from the top by the user i conditioned on the top ℓ 1 choices such that pj|(i,ℓ) P {vi,ℓ= j|vi,1, . . . , vi,ℓ 1, Si} = eΘ ij/(P j Si,ℓeΘij ) and Si,ℓ Si \ {vi,1, . . . , vi,ℓ 1}, where Si is the set of alternatives presented to the i-th user and vi,ℓis the item ranked at the ℓ-th position by the user i. Notice that for i = i , 2L(Θ) Θij Θi j = 0 and the Hessian is 2L(Θ) Θij Θij = 1 k d1 ℓ=1 I j Si,ℓ pj|(i,ℓ) ℓ=1 I j, j Si,ℓ pj|(i,ℓ)I(j = j ) pj|(i,ℓ)pj |(i,ℓ) . (133) This Hessian matrix is a block-diagonal matrix 2L(Θ) = diag(H(1)(Θ), . . . , H(d1)(Θ)) with H(i)(Θ) = 1 k d1 diag(pi,ℓ) pi,ℓp T i,ℓ . (134) Let = Θ bΘ where bΘ is the optimal solution of the convex program in (31). We first introduce three key technical lemmas. The first lemma follows from Lemma 1 of Negahban and Wainwright (2012), and shows that is approximately low-rank. Lemma 27 If λ 2||| L(Θ )|||2, then we have ||| |||nuc 4 2r||| |||F + 4 min{d1,d2} X j=r+1 σj(Θ ) , (135) for all r [min{d1, d2}]. Proof of the above lemma is omitted because of its similarity to that of Lemma 22. The following lemma provides a bound on the gradient using the concentration in measure of sum of independent random matrices Tropp (2011). Negahban, Oh, Thekumparampil, and Xu Lemma 28 For any positive constant c 1 and k (1/e) d2(4 log d2 + log d1), with probability at least 1 2d c d 3 2 , ||| L(Θ )|||2 4(1 + c) log d k d2 1 max np d1/d2, e2αp 4(1 + c) log(d)(8 log d2 + 2 log d1) log k o . (136) Since we are typically interested in the regime where the number of samples is much smaller than the dimension d1 d2 of the problem, the Hessian is typically not positive definite. However, when we restrict our attention to the vectorized with relatively small nuclear norm, then we can prove restricted strong convexity, which gives the following bound. Lemma 29 (Restricted Strong Convexity for collaborative ranking) Fix any Θ Ωα and assume 24 k min{d2 1, (d2 1 +d2 2)/(2d1)} log d. Under the random sampling model of the alternatives {jiℓ}i [d1],ℓ [k] and the random outcome of the comparisons described in section 1, with probability larger than 1 2d 218, Vec( )T 2L(Θ) Vec( ) e 4α 24 d1d2 ||| |||2 F , (137) for all in A where A = n Rd1 d2 ||| ||| 2α , X j [d2] ij = 0 for all i [d1] and ||| |||2 F µ||| |||nuc o . µ 210 e2α α d2 d1 log d k min{d1, d2} . (139) Building on these lemmas, the proof of Theorem 7 is divided into the following two cases. In both cases, we will show that ||| |||2 F 72 e4αc0λ0 d1d2 ||| |||nuc , (140) with high probability. Applying Lemma 27 proves the desired theorem. We are left to show Eq. (140) holds. Case 1: Suppose ||| |||2 F µ ||| |||nuc. With = Θ bΘ, the Taylor expansion yields L(bΘ) = L(Θ ) L(Θ ), + 1 2Vec( ) 2L(Θ)Vec T ( ), (141) where Θ = abΘ+(1 a)Θ for some a [0, 1]. It follows from Lemma 29 that with probability at least 1 2d 218, L(bΘ) L(Θ ) L(Θ ), + e 4α 48 d1 d2 ||| |||2 F ||| L(Θ )|||2||| |||nuc + e 4α 48 d1 d2 ||| |||2 F . Learning from Comparisons and Choices From the definition of bΘ as an optimal solution of the minimization, we have L(bΘ) L(Θ ) λ |||Θ |||nuc bΘ nuc λ||| |||nuc . By the assumption, we choose λ 480λ0. In view of Lemma 28, this implies that λ 2||| L(Θ )|||2 with probability at least 1 2d 3. It follows that with probability at least 1 2d 3 2d 218, 48d1d2 ||| |||2 F λ + ||| L(Θ )|||2 ||| |||nuc 3λ 2 ||| |||nuc . By our assumption on λ c0λ0, this proves the desired bound in Eq. (140) Case 2: Suppose ||| |||2 F µ ||| |||nuc. By the definition of µ and the fact that c0 480, it follows that µ 72 e4αc0λ0 d1d2, and we get the same bound as in Eq. (140). C.1 Proof of Lemma 28 Define Xi = ei Pk ℓ=1(evi,ℓ pi,ℓ)T such that L(Θ ) = 1 k d1 Pd1 i=1 Xi, which is a sum of d1 independent random matrices. Although |||Xi|||2 can be as large as O(k), this occurs with very low probability. We make this precise in the following lemma and focus on the case where |||Xi|||2 = O( k) for all i [d1]. Lemma 30 For a fixed i [d1] and j [d2], if k (1/e) d2 (4 log d2 + log d1), then the number of times the item j is observed by the user i is at most 8(log d2) + 2(log d1) with probability larger than 1 1/(d4 2d1). Proof is given in the end of this Section. Applying union bound over the d1 items and d2 users, we have the multiplicity in sampling for any item for all users is bounded by 8(log d2) + 2(log d1) with probability at least 1 d 3 2 . We denote this event by A and let I (A) be the indicator function that all the multiplicities in sampling are bounded. We first upper bound |||(P i Xi) I (A)|||2 using the Matrix Bernstein inequality Tropp (2011). |||Xi I (A)|||2 = I (A) ℓ=1 evi,l + I (A) (b) (8(log d2) + 2(log d1)) p min{k, d2} 1 + k(8(log d2) + 2(log d1)) 1 + 2e2α log k k(8(log d2) + 2(log d1))e2α log k , (142) where (a) is by triangle inequality, (b) is because under the given event A each term in P ℓevi,ℓand P l pi,ℓare upper bounded by log d2 and Pk ℓ=1 e2α ℓ log d2 respectively and because there can be at most min{d2, k} non-zero entries in the two vectors P ℓevi,ℓand Negahban, Oh, Thekumparampil, and Xu P ℓpi,ℓand, (c) is due to the fact that k-th harmonic number Pk ℓ=1 1 ℓis upper bounded by log k. We also have, i E Xi XT i I (A) i E Xi XT i i=1 eie T i E evi,ℓ pi,ℓ T evi,ℓ pi,ℓ i=1 eie T i E evi,ℓ pi,ℓ T evi,ℓ pi,ℓ # i=1 eie T i E ℓ=1 e T vi,ℓevi,ℓ p T i,ℓpi,ℓ i=1 eie T i E ℓ=1 e T vi,ℓevi,ℓ 2 = k|||Id1 d1|||2 = k, (143) i=1 E XT i Xi I (A) i=1 E XT i Xi ℓ,ℓ =1 (evi,ℓ pi,ℓ)(evi,ℓ pi,ℓ )T ℓ=1 (evi,ℓ pi,ℓ)(evi,ℓ pi,ℓ)T # ℓ=1 evi,ℓe T vi,ℓ pi,ℓp T i,ℓ ℓ=1 evi,ℓe T vi,ℓ k d2 Id2 d2 By matrix Bernstein inequality Tropp (2011), P ||| L(Θ )I (A)|||2 > t (d1 + d2) exp k2 d2 1 t2/2 (d1k/ min{d2, d1}) + (3e2αk3/2d1(8(log d2) + 2(log d1)) log k t/3) Learning from Comparisons and Choices which gives the tail probability of 2d c for the choice of 4(1 + c) log d k d1 min{d2, d1} , 4(1 + c)e2α log(d) (8(log d2) + 2(log d1)) log k 4(1 + c) log d k1/2 d1 max np d1/d2 , e2αp 4(1 + c) log(d) (8(log d2) + 2(log d1)) log k o . Now with a high probability of 1 2 d3 2 the desired bound is true. C.2 Proof of Lemma 30 In a classical balls-in-bins setting, we consider k as the number of balls and d2 as the number of bins. We can consider the number of balls in a particular bin as the number of times the user i observes item j. Let the event that this number is at least δ be denoted by the event Aj δ. Then, P n Aj δ o k δ 1 dδ 2 ke d2δ δ . Using the fact that (1/x)x a for any x (2 log(1/a))/(log log(1/a)), we let x = d2δ/(ke) to get ke for δ (ke/d2)(2 log(1/a))/(log log(1/a)). Choosing a = (1/d4 2d1)d2/ke, we have P n Aj δ o 1/(d1d4 2), for a choice of δ = 2 log(d4 2d1) 2 log(d4 2d1)/(log((d2/ke) log(d4 2d1))). C.3 Proof of Lemma 29 Recall that the Hessian matrix is a block-diagonal matrix with the i-th block H(i)(Θ) given by (134). We use the following remark from Hajek et al. (2014) to bound the Hessian. Remark 31 (Hajek et al., 2014, Claim 1) Given θ Rr, let p be the column probability vector with pi = eθi/(eθ1 + + eθρ) for each i [ρ] and for any positive integer ρ. If |θi| α, for all i [ρ], then e2α diag(p) pp T 1 By letting 1Si,ℓ= P j Si,ℓej and applying the above claim, we have e2αH(i)(Θ) 1 k d1 1 k ℓ+ 1diag(1Si,ℓ) 1 (k ℓ+ 1)2 1Si,ℓ1T Si,ℓ 1 (k ℓ+ 1)2 X j,j Si,ℓ (ej ej )(ej ej )T j,j Si,ℓ (ej ej )(ej ej )T . Negahban, Oh, Thekumparampil, and Xu Vec( ) 2L(Θ)Vec T ( ) = i=1 ( T ei)T H(i)(Θ)( T ei) e T i (ej ej ) 2 2. By changing the order of the summation, we get that e T i (ej ej ) 2 2 = , ei,ji,ℓ ei,ji,ℓ 2 k X ℓ =1 I σi(ji,ℓ ) min{σi(ji,ℓ), σi(ji,ℓ )} . χi,ℓ,ℓ ,ℓ I σi(ji,ℓ ) min{σi(ji,ℓ), σi(ji,ℓ )} , (146) , ei,ji,ℓ ei,ji,ℓ 2 k X ℓ =1 χi,ℓ,ℓ ,ℓ . Then we have Vec T ( ) 2L(Θ)Vec( ) H( ). To prove the theorem, it suffices to bound H( ) from the below. First, we prove a lower bound on the expectation E[H( )]. Notice that for ℓ = ℓ , the conditional expectation of χi,ℓ,ℓ ,ℓ s, given the set of alternatives presented to user i is ℓ =1 χi,ℓ,ℓ ,ℓ ji,1, . . . , ji,k i = 1 + X exp(θi,ji,ℓ ) exp(θi,ji,ℓ ) + exp(θi,ji,ℓ ) + exp(θi,ji,ℓ) 1 + k 2 1 + 2e2α k 3e2α . E[H( )] = e 2α , ei,ji,ℓ ei,ji,ℓ 2E k X ℓ =1 χi,ℓ,ℓ ,ℓ ji,1, . . . , ji,k i ℓ,ℓ [k] E h , ei,ji,ℓ ei,ji,ℓ 2i j,j =1 ij ij = e 4α(k 1) 3 k d1 d2 ||| |||2 F , (147) Learning from Comparisons and Choices where the last equality holds because P j [d2] ij = 0 for Ω2α and for all i [d1]. We are left to prove that H( ) cannot deviate from its mean too much. Suppose there exists a A such that Eq. (137) is violated, i.e. H( ) < (e 4α/(24 d1d2))||| |||2 F. We will show this happens with a small probability. From Eq. (147), we get that for k 24, E[H( )] H( ) (7k 8) d1 d2 ||| |||2 F (20/3) e 4α 24 d1d2 ||| |||2 F . (148) We use a peeling argument as in (Negahban and Wainwright, 2012, Lemma 3), Van De Geer (2000) to upper bound the probability that Eq. (148) is true. We first construct the following family of subsets to cover A such that A S ℓ=1 Sℓ. Recall µ = 210e2ααd2 p (d1 log d)/(k min{d1, d2}), define in (139). Notice that since for any A, ||| |||2 F µ||| |||nuc µ||| |||F, it follows that ||| |||F µ. Then, we can cover A with the family of sets Sℓ= n Rd1 d2 ||| ||| 2α , βℓ 1µ ||| |||F βℓµ , X j [d2] ij = 0 for all i [d1], and ||| |||nuc β2ℓµ o , where β = p 10/9 and for ℓ {1, 2, 3, . . .}. This implies that when there exists a A such that (148) holds, then there exists an ℓ Z+ such that Sℓand E[H( )] H( ) (20/3) e 4α 24 d1d2 β2(ℓ 1)µ2 4 d1d2 β2ℓµ2 . (149) Applying the union bound over ℓ Z+, we get from (148) and (149) that P A , H( ) < e 4α 24 d1d2 ||| |||2 F E[H( )] H( ) > e 4α 4 d1d2 (βℓµ)2 ) E[H( )] H( ) > e 4α 4 d1d2 (βℓµ)2 ) where we define a new set B(D) such that Sℓ B(βℓµ): B(D) = Rd1 d2 2α, ||| |||F D, X j [d2] ij = 0 for all i [d1], µ||| |||nuc D2 o . (151) The following key lemma provides the upper bound on this probability. Negahban, Oh, Thekumparampil, and Xu Lemma 32 For (16 min{d1, d2} log d)/(3d1) k d2 1 log d, E[H( )] H( ) e 4α exp n e 4α k D4 219α4d1d2 2 Let η = exp e 4α4k(β 1.002)µ4 219α4d1d2 2 . Applying the tail bound to (150), we get P A , H( ) < e 4α 24 d1d2 ||| |||2 F ℓ=1 exp n e 4αk(βℓµ)4 219α4d1d2 2 ℓ=1 exp n e 4α4kℓ(β 1.002)µ4 219α4d1d2 2 where (a) holds because βx x log β x(β 1.002) for the choice of β = p 10/9. By the definition of µ, η = exp n 223 e4αd2 2d1(log d)2(β 1.002) k(min{d1, d2})2 o exp{ 218 log d} , where the last inequality follows from the assumption that k max{d1, d2 2/d1} log d = (d2 2d1 log d)/(min{d1, d2})2, and β 1.002 2 5. Since for d 2, exp{ 218 log d} 1/2 and thus η 1/2, the lemma follows by assembling the last two displayed inequalities. C.4 Proof of Lemma 32 Recall that H( ) = e 2α , ei,ji,ℓ ei,ji,ℓ 2 k X ℓ =1 χi,ℓ,ℓ ,ℓ , with χi,ℓ,ℓ ,ℓ = I σi(ji,ℓ ) min{σi(ji,ℓ), σi(ji,ℓ )} . Let Z = sup B(D) E[H( )] H( ) be the worst-case random deviation of H( ) form its mean. We prove an upper bound on Z by showing that Z E[Z] e 4αD2/(64d1d2) with high probability, and E[Z] 9e 4αD2/(40d1d2). This proves the desired claim in Lemma 32. To prove the concentration of Z, we utilize the random utility model (RUM) theoretic interpretation of the MNL model. The random variable Z depends on the random choice of alternatives {ji,ℓ}i [d1],ℓ [k] and the random k-wise ranking outcomes {σi}i [d1]. The random utility theory, pioneered by Thurstone (1927); Marschak (1960); Luce (1959), tells us that the k-wise ranking from the MNL model has the same distribution as first drawing independent (unobserved) utilities ui,ℓ s of the item ji,ℓfor user i according to the standard Gumbel Cumulative Distribution Function (CDF) F(c Θi,ji,ℓ) with F(c) = e e c, and then ranking the k items for user i according to their respective utilities. Given this definition of the MNL model, we have χi,ℓ,ℓ ,ℓ = I ui,ℓ max{ui,ℓ, ui,ℓ } . Thus Z is a function of independent Learning from Comparisons and Choices choices of the items and their (unobserved) utilities, i.e. Z = f({(ji,ℓ, ui,ℓ)}i [d1],ℓ [k]). Let xi,ℓ= (ji,ℓ, ui,ℓ) and write H( ) as H( , {xi,ℓ}i [d1],ℓ [k]). This allows us to bound the difference and apply Mc Diarmid s tail bound. Note that for any i [d1], ℓ [k], x1,1, . . . , xd1,k, and x i,ℓ, f x1,1, . . . , xi,ℓ, . . . , xd1,k f x1,1, . . . , x i,ℓ, . . . , xd1,k = sup B(D) (E [H( )] H( , x1,1, . . . , xi,ℓ, . . . , xd1,k)) E [H( )] H( , x1,1, . . . , x i,ℓ, . . . , xd1,k) H( , x1,1, . . . , xi,ℓ, . . . , xd1,k) H( , x1,1, . . . , x i,ℓ, . . . , xd1,k) 2 k3 d1 sup B(D) , ei,ji,ℓ ei,ji,ℓ 2 k X ℓ =1 χi,ℓ,ℓ ,ℓ + , ei,ji,ℓ ei,ji,ℓ 2χi,ℓ ,ℓ ,ℓ o (b) 8α2e 2α ℓ =1 χi,ℓ,ℓ ,ℓ + X ℓ ,ℓ [k],ℓ =ℓ , χi,ℓ ,ℓ ,ℓ o where (a) follows because for a fixed i and ℓ, the random variable xi,ℓ= (ji,ℓ, ui,ℓ) can appear in three terms, i.e. P ℓ ,ℓ , ei,ji,ℓ ei,ji,ℓ 2χi,ℓ,ℓ ,ℓ + P ℓ ,ℓ , ei,ji,ℓ ei,ji,ℓ 2χi,ℓ ,ℓ,ℓ + P ℓ ,ℓ , ei,ji,ℓ ei,ji,ℓ 2χi,ℓ ,ℓ ,ℓ, and (b) follows because | ij| 2α for all i, j since B(D). The last inequality follows because in the worst case, P ℓ [k]\{ℓ} Pk ℓ =1 χi,ℓ,ℓ ,ℓ k(k 1)/2 and P ℓ ,ℓ [k],ℓ =ℓ χi,ℓ ,ℓ ,ℓ k(k 1). This holds with equality if σi(ji,ℓ) = k and σi(ji,ℓ) = 1, respectively. By bounded differences inequality, we have P {Z E [Z] t} exp k2 d2 1 t2 27 α4e 4αd1k It follows that for the choice of t = e 4αD2/(64d1d2), P Z E [Z] e 4αD2 exp e 4αk D4 219α4d1d2 2 We are left to prove the upper bound on E[Z] using symmetrization and contraction. Define random variables Yi,ℓ,ℓ ,ℓ ( ) ( i,ji,ℓ i,ji,ℓ )2χi,ℓ,ℓ ,ℓ , (153) where the randomness is in the choice of alternatives ji,ℓ, ji,ℓ , and ji,ℓ , and the outcome of the comparisons of those three alternatives. Negahban, Oh, Thekumparampil, and Xu The main challenge in applying the symmetrization to P ℓ,ℓ ,ℓ [k] Yi,ℓ,ℓ ,ℓ ( ) is that we need to partition the summation over the set [k] [k] [k] into subsets of independent random variables, such that we can apply the standard symmetrization argument. To this end, we prove in the following lemma a generalization of the well-known problem of scheduling a round robin tournament to a tournament of matches involving three teams each. No teams are present in more than one triple in a single round, and we want to minimize the number of rounds to cover all combination of triples are matched. For example, when there are k = 6 teams, there is a simple construction of such a tournament: T1 = {(1, 2, 3), (4, 5, 6)}, T2 = {1, 2, 4), (3, 5, 6)}, T3 = {(1, 2, 5), (3, 4, 6)}, T4 = {(1, 2, 6), (3, 4, 5)}, T5 = {(1, 3, 4), (2, 5, 6)}, T6 = {(1, 3, 5), (2, 4, 6)}, T7 = {(1, 3, 6), (2, 4, 5)}, T8 = {(1, 4, 5), (2, 3, 6)}, T9 = {(1, 4, 6), (2, 3, 5)}, T10 = {(1, 5, 6), (2, 3, 4)}. This is a perfect scheduling of a tournament with three teams in each match. For a general k, the following lemma provides a construction with O(k2) rounds. Lemma 33 There exists a partition (T1, . . . , TN) of [k] [k] [k] for some N 24k2 such that Ta s are disjoint subsets of [k] [k] [k], S a [N] Ta = [k] [k] [k], |Ta| k/3 and for any a [N] the set of random variables in Ta satisfy {Yi,ℓ,ℓ ,ℓ }i [d1],(ℓ,ℓ ,ℓ ) Ta are mutually independent . Now, we are ready to partition the summation. 2 k3 d1 E h sup B(D) E[Yi,ℓ,ℓ ,ℓ ( )] Yi,ℓ,ℓ ,ℓ ( ) i 2 k3 d1 E h sup B(D) (ℓ,ℓ ,ℓ ) Ta E[Yi,ℓ,ℓ ,ℓ ( )] Yi,ℓ,ℓ ,ℓ ( ) i a [N] E h sup B(D) (ℓ,ℓ ,ℓ ) Ta E[Yi,ℓ,ℓ ,ℓ ( )] Yi,ℓ,ℓ ,ℓ ( ) i a [N] E h sup B(D) (ℓ,ℓ ,ℓ ) Ta ξi,ℓ,ℓ ,ℓ Yi,ℓ,ℓ ,ℓ ( ) i a [N] E h sup B(D) (ℓ,ℓ ,ℓ ) Ta ξi,ℓ,ℓ ,ℓ ( i,ji,ℓ i,ji,ℓ )2χi,ℓ,ℓ ,ℓ i ,(154) where the first inequality follows from the fact that sum of the supremum is no less than the supremum of the sum, and the second inequality follows from standard symmetrization argument applied to independent random variables {Yi,ℓ,ℓ ,ℓ ( )}i [d1],(ℓ,ℓ ,ℓ ) Ta with i.i.d. Rademacher random variables ξi,ℓ,ℓ ,ℓ s. Since ( i,ji,ℓ i,ji,ℓ )2χi,ℓ,ℓ ,ℓ 4α| i,ji,ℓ i,ji,ℓ |χi,ℓ,ℓ ,ℓ , we have by the Ledoux-Talagrand contraction inequality Ledoux and Tala- Learning from Comparisons and Choices grand (2013) that E h sup B(D) (ℓ,ℓ ,ℓ ) Ta ξi,ℓ,ℓ ,ℓ ( i,ji,ℓ i,ji,ℓ )2χi,ℓ,ℓ ,ℓ i 8αE h sup B(D) (ℓ,ℓ ,ℓ ) Ta ξi,ℓ,ℓ ,ℓ χi,ℓ,ℓ ,ℓ , ei(eji,ℓ eji,ℓ )T Applying H older s inequality, we get that X (ℓ,ℓ ,ℓ ) Ta ξi,ℓ,ℓ ,ℓ χi,ℓ,ℓ ,ℓ , ei(eji,ℓ eji,ℓ )T (ℓ,ℓ ,ℓ ) Ta ξi,ℓ,ℓ ,ℓ χi,ℓ,ℓ ,ℓ ei(eji,ℓ eji,ℓ )T We are left to prove that the expected value of the right-hand side of the above inequality is bounded by C||| |||nuc p kd1 log d/ min{d1, d2} for some numerical constant C. For i [d1] and (ℓ, ℓ , ℓ ) Ta, let Wi,ℓ,ℓ ,ℓ = ξi,ℓ,ℓ ,ℓ χi,ℓ,ℓ ,ℓ ei(eji,ℓ eji,ℓ )T be independent zeromean random matrices, such that Wi,ℓ,ℓ ,ℓ 2 = ξi,ℓ,ℓ ,ℓ χi,ℓ,ℓ ,ℓ ei(eji,ℓ eji,ℓ )T 2 almost surely, and E[Wi,ℓ,ℓ ,ℓ W T i,ℓ,ℓ ,ℓ ] = E[ ei(eji,ℓ eji,ℓ )T (eji,ℓ eji,ℓ )e T i χi,ℓ,ℓ ,ℓ ] = 2E χi,ℓ,ℓ ,ℓ eie T i 2eie T i , E[W T i,ℓ,ℓ ,ℓ Wi,ℓ,ℓ ,ℓ ] = E[ (eji,ℓ eji,ℓ )e T i ei(eji,ℓ eji,ℓ )T χi,ℓ,ℓ ,ℓ ] E[(eji,ℓ eji,ℓ )e T i ei(eji,ℓ eji,ℓ )T ] = 2 d2 Id2 d2 2 i [d1] (ℓ,ℓ ,ℓ ) Ta E[Wi,ℓ,ℓ ,ℓ W T i,ℓ,ℓ ,ℓ ] i [d1] (ℓ,ℓ ,ℓ ) Ta E[W T i,ℓ,ℓ ,ℓ Wi,ℓ,ℓ ,ℓ ] max 2|Ta| , 2d1|Ta| = 2d1|Ta| min{d1, d2} 2d1k 3 min{d1, d2} , Negahban, Oh, Thekumparampil, and Xu since we have designed Ta s such that |Ta| k/3. Applying matrix Bernstein inequality Tropp (2011) yields the tail bound (ℓ,ℓ ,ℓ ) Ta Wi,ℓ,ℓ ,ℓ (d1 + d2) exp t2/2 σ2 + Choosing t = max p 32kd1 log d/(3 min{d1, d2}), (16 2/3) log d , we obtain with probability at least 1 2d 3, (ℓ,ℓ ,ℓ ) Ta Wi,ℓ,ℓ ,ℓ 32kd1 log d 3 min{d1, d2} , 16 It follows from the fact P i [d1] P (ℓ,ℓ ,ℓ ) Ta Wi,ℓ,ℓ ,ℓ 2 P i,(ℓ,ℓ ,ℓ ) Wi,ℓ,ℓ ,ℓ 2 (ℓ,ℓ ,ℓ ) Ta Wi,ℓ,ℓ ,ℓ 32kd1 log d 3 min{d1, d2}, 16 32kd1 log d 3 min{d1, d2} , where the last inequality follows from the assumption that (16 min{d1, d2} log d)/(3d1) k d2 1 log d. Substituting this in the RHS of Eq. (156), and then together with Eqs. (155) and (154), this gives the following desired bound: a [N] sup B(D) 32kd1 log d 3 min{d1, d2}||| |||nuc d1 log d k min{d1, d2} where the last inequality holds because N 4k2 and µ||| |||nuc D2. C.5 Proof of Lemma 33 Recall that Yi,ℓ,ℓ ,ℓ ( ) = ( i,ji,ℓ i,ji,ℓ )2χi,ℓ,ℓ ,ℓ , as defined in (153). From the random utility model (RUM) interpretation of the MNL model presented in Section 1, it is not difficult to show that Yi,ℓ,ℓ ,ℓ and Yi, ℓ, ℓ , ℓ are mutually independent if the two triples (ℓ, ℓ , ℓ ) and ( ℓ, ℓ , ℓ ) do not overlap, i.e., no index is present in both triples. Now, borrowing the terminologies from round robin tournaments, we construct a schedule for a tournament with k teams where each match involve three teams. Let Ta,b denote Learning from Comparisons and Choices a set of triples playing at the same round, indexed by two integers a {3, . . . , 2k 3} and b {5, . . . , 2k 1}. Hence, there are total N = (2k 5)2 rounds. Each round (a, b) consists of disjoint triples and is defined as Ta,b (ℓ, ℓ , ℓ ) [k] [k] [k] | ℓ< ℓ < ℓ , ℓ+ ℓ = a, and ℓ + ℓ = b . We need to prove that (a) there is no missing triple; and (b) no team plays twice in a single round. First, for any ordered triple (ℓ, ℓ , ℓ ), there exists a {3, . . . , 2k 3} and b {5, . . . , 2k 1} such that ℓ+ ℓ = a and ℓ + ℓ = b. This proves that all ordered triples are covered by the above construction. Next, given a pair (a, b), no two triples in Ta,b can share the same team. Suppose there exists two distinct ordered triples (ℓ, ℓ , ℓ ) and ( ℓ, ℓ , ℓ ) both in Ta,b, and one of the triples are shared. Then, from the two equations ℓ+ ℓ = ℓ+ ℓ = a and ℓ + ℓ = ℓ + ℓ = b, it follows that all three indices must be the same, which is a contradiction. This proves the desired claim for ordered triples. One caveat is that we wanted to cover the whole [k] [k] [k], and not just the ordered triples. In the above construction, for example, a triple (3, 2, 1) does not appear. This can be resolved by simply taking all Ta,b s from the above construction, and make 6 copies of each round, and permuting all the triples in each copy according to the same permutation over {1, 2, 3}. This increases the total rounds to N = 6(2k 5)2 24k2. Note that |Ta,b| k/3 since no item can be in more than one triple. Appendix D. Proof of Estimating Approximate Low-rank Matrices in Corollary 9 We follow closely the proof of a similar corollary in Negahban and Wainwright (2012). First fix a threshold τ > 0, and set r = max{j|σj(Θ ) > τ}. With this choice of r, we have min{d1,d2} X j=r+1 σj(Θ ) = τ min{d1,d2} X min{d1,d2} X q τ 1 qρq . Also, since rτ q Pr j=1 σj(Θ )q ρq, it follows that r ρqτ q/2. Using these bounds, Eq. (35) is now bΘ Θ 2 2c0e4αd1d2λ0 | {z } =A ρqτ q/2 bΘ Θ F + τ 1 qρq . With the choice of τ = A and due to the fact that x2 bx + c implies x (b + b2 + 4c)/2 we have, bΘ Θ F 2 ρq A(2 q)/2 . Appendix E. Proof of the Information-theoretic Lower Bound in Theorem 10 The proof uses information-theoretic methods which reduces the estimation problem to a multiway hypothesis testing problem. To prove a lower bound on the expected error, it Negahban, Oh, Thekumparampil, and Xu suffices to prove that, sup Θ Ωα P bΘ Θ 2 1 2 . (157) To prove the above claim, we follow the standard recipe of constructing a packing in Ωα. Consider a family {Θ(1), . . . , Θ(M(δ)} of d1 d2 dimensional matrices contained in Ωα satisfying Θ(ℓ1) Θ(ℓ2) F δ for all ℓ1, ℓ2, [M(δ)]. We will use M to refer to M(δ) for simplify the notation. Suppose we draw an index L [M(δ)] uniformly at random, and we are given direct observations σi as per MNL model with Θ = Θ(L) on a randomly chosen set of k items Si for each user i [d1]. It follows from triangular inequality that sup Θ Ωα P bΘ Θ 2 P n b L = L o , (158) where b L is the resulting best estimate of the multiway hypothesis testing on L. The generalized Fano s inequality gives P n b L = L|S(1), . . . , S(d1) o 1 I(b L; L) + log 2 log M (159) M 2 1 P ℓ1,ℓ2 [M] DKL(Θ(ℓ1) Θ(ℓ2)) + log 2 log M ,(160) where DKL(Θ(ℓ1) Θ(ℓ2)) denotes the Kullback-Leibler divergence between the distributions of the partial rankings P σ1, . . . , σd1|Θ(ℓ1), S(1), . . . , S(d1) and P σ1, . . . , σd1|Θ(ℓ2), S(1), . . . , S(d1) . The second inequality follows from a standard technique, which we repeat here for completeness. Let Σ = {σ1, . . . , σd1} denote the observed outcome of comparisons. Since L Θ(L) Σ b L form a Markov chain, the data processing inequality gives I(b L; L) I(Σ; L). For simplicity, we drop the conditioning on the set of alternatives {S(1), . . . , S(d1)}, and and let p( ) denotes joint, marginal, and conditional distribution of respective random variables. It follows that I(Σ; L) = X ℓ [M],Σ p(Σ|ℓ) 1 M log p(ℓ, Σ) Σ p(Σ|ℓ) log p(Σ|ℓ) 1 M P ℓ p(Σ|ℓ ) Σ p(Σ|ℓ) log p(Σ|ℓ) ℓ,ℓ [M] DKL(Θ(ℓ1) Θ(ℓ2)) , (161) where the first inequality follows from Jensen s inequality. To compute the KL-divergence, recall that from the RUM interpretation of the MNL model (see Section 1), one can generate sample rankings Σ by drawing random variables with exponential distributions with mean Learning from Comparisons and Choices eΘ ij s. Precisely, let X(ℓ) = [X(ℓ) ij ]i [d1],j Si denote the set of random variables, where X(ℓ) ij is drawn from the exponential distribution with mean e Θ(ℓ) ij . The MNL ranking follows by ordering the alternatives in each Si according to this {X(ℓ) ij }j Si by ranking the smaller ones on the top. This forms a Markov chain L X(L) Σ, and the standard data processing inequality gives DKL(Θ(ℓ1) Θ(ℓ2)) DKL(X(ℓ1) X(ℓ2)) (162) n eΘ(ℓ1) ij Θ(ℓ2) ij (Θ(ℓ1) ij Θ(ℓ2) ij ) 1 o j Si (Θ(ℓ1) ij Θ(ℓ2) ij )2 , (164) where the last inequality follows from the fact that ex x 1 (e2α/(4α2))x2 for any x [ 2α, 2α]. Taking expectation over the randomly chosen set of alternatives, ES(1),...,S(d1)[DKL(Θ(ℓ1) Θ(ℓ2))] e2α k 4 α2 d2 Θ(ℓ1) Θ(ℓ2) 2 Combined with (160), we get that P n b L = L o = ES(1),...,S(d1)[P n b L = L|S(1), . . . , S(d1) o ] (166) M 2 1 P ℓ1,ℓ2 [M](e2αk/(4α2d2)) Θ(ℓ1) Θ(ℓ2) 2 F + log 2 log M , (167) The remainder of the proof relies on the following probabilistic packing. Lemma 34 Let d2 d1 607 be positive integers. Then for each r {1, . . . , d1}, and for any positive δ > 0 there exists a family of d1 d2 dimensional matrices {Θ(1), . . . , Θ(M(δ))} with cardinality M(δ) = (1/4) exp(rd2/576) such that each matrix is rank r and the following bounds hold: Θ(ℓ) F δ , for all ℓ [M] (168) Θ(ℓ1) Θ(ℓ2) F δ , for all ℓ1, ℓ2 [M] (169) Θ(ℓ) Ω α , for all ℓ [M] , (170) with α = (8δ/d2) 2 log d for d = (d1 + d2)/2. We omit the proof of the above lemma since it is similar to that of Lemma 25. Suppose δ αd2/(8 2 log d) such that the matrices in the packing set are entry-wise bounded by α, then the above lemma implies that Θ(ℓ1) Θ(ℓ2) 2 F 4δ2, which gives P n b L = L o 1 α2d2 + log 2 rd 576 2 log 2 1 Negahban, Oh, Thekumparampil, and Xu where the last inequality holds for δ2 (α2d2/(e2αk))((rd/1152) 2 log 2). If we assume rd 3195 for simplicity, this bound on δ can be simplified to δ αe αp r d2 d/(2304 k). Together with (157) and (158), this proves that for all δ min{αd2/(8 2 log d), αe αp r d2 d/(2304 k)}, inf bΘ sup Θ Ωα E h bΘ Θ F Choosing δ appropriately to maximize the right-hand side finishes the proof of the desired claim. Appendix F. Proof of Pairwise Rank Breaking in Theorem 11 Analogous to Section C, we define the gradient L(Θ) as ij L = L(Θ) Θij and ˆΘ Θ , and provide two main technical lemmas. Lemma 35 If λ 2||| L(Θ )|||2, then we have, ||| |||nuc 4 2r||| |||F + 4 min{d1,d2} X j=ρ+1 σj(Θ ) , (171) for all ρ [min{d1, d2}]. Proof This follows from the proof of Lemma 27, which only depends on the convexity of L(Θ). Lemma 36 For any positive constant c 1, if k max{d1, d2 2/d1} log d and d1 4 then with probability at least 1 2d c, ||| L(Θ )|||2 16(c + 4) log d 2(c + 4) log d The proof of this lemma is provided in Section F.1. We will simplify the above lemma by assuming, 2(c + 4) log d k which implies the last term in RHS is less than equal to first term, 2(4 + c) log d 1 4 . (173) (173) simplifies (172) as, ||| L(Θ )|||2 16(c + 4) log d k d2 1 max 1 32d (c + 4) log d 32(c + 4)λ , (174) Learning from Comparisons and Choices where (a) is due to (41) . For Lemma 35 and further proof of Theorem11 we want λ 2||| L(Θ)|||2, therefore we assume that, 32(c + 4)λ, cpλ], for some cp 2 p 32(c + 4) (175) Similar to the k-wise ranking,we will divide the proof into two cases and each part we will prove that ||| |||2 F 36e2α c λ d1d2 ||| |||nuc with probability at least 1 2/dc 2/d213. We define a new constant µ as, 48 d1d2 2 log d k min{d1, d2} . (176) Case 1: Assume µ||| |||nuc ||| |||2 F. Since L is a sum of a linear function of Θ and log-sum-exponential functions, which are convex, we know that L is a convex function of Θ. Therefore, by convexity and Taylor expansion we get, L(ˆΘ) = L(Θ ) L(Θ ), + (177) eΘi,ui,m1 eΘi,ui,m2 eΘi,ui,m1 + eΘi,ui,m2 2 i,ui,m1 i,ui,m2 where Θ = aΘ + (1 a)ˆΘ for some a [0, 1] and P0 = {(i, j)| 1 i < j k}. We lower bound the final term in (177) as, eΘi,ui,m1 eΘi,ui,m2 eΘi,ui,m1 + eΘi,ui,m2 2 i,ui,m1 i,ui,m2 (e α + eα)2 i,ui,m1 i,ui,m2 i,ui,m1 i,ui,m2 where (a) is due to the fact that ij s are upper and lower bounded by α and α respectively. We can bound this term further according to the following Lemma. Lemma 37 For (4 log d)/9 k max{d1, d2 2/d1} log d, with probability at least 1 2d 213, i,ui,m1 i,ui,m2 2 1 3d1d2 ||| |||2 F , (179) for all Ap where, Rd1 d2 ||| ||| 2α, X j [d2] ij = 0 , for all i [d2], and µ||| |||nuc ||| |||2 F Negahban, Oh, Thekumparampil, and Xu The proof is given in Section F.2. Now using Lemma 37 and (178) with high probability we get, eΘi,ui,m1 eΘi,ui,m2 eΘi,ui,m1 + eΘi,ui,m2 2 i,ui,m1 i,ui,m1 24 d1 d2 ||| |||2 F . Incorporating the above inequality in (177) we obtain, 24 d1 d2 ||| |||2 F L(ˆΘ) L(Θ ) + L(Θ ), . (182) From the definition of ˆΘ we have L(ˆΘ) L(Θ ) λ |||Θ |||nuc ˆΘ nuc λ||| |||nuc, and we assume that λ 2 p 32(c + 1) λ, so that λ 2||| L (Θ )|||2 is true with a probability of at least 1 2d c from Lemma 36 . These give us the following with at least probability 1 2d c 2d 213. 24 d1 d2 ||| |||2 F λ||| |||nuc + ||| L(Θ )|||2||| |||nuc 2 ||| |||nuc (183) which gives us, ||| |||2 F 36e2α λ d1d2 ||| |||nuc (a) 36e2α cp λ d1d2 ||| |||nuc , (184) where (a) is due to the fact that λ cpλ. Case 2: Assume ||| |||2 F µ||| |||nuc. Here we prove that µ 36 e2α cpλ d1d2. µ 36 e2α cpλ d1d2 (a) α e2α 16 d1d2 min{d1, d2}d max{d1, d2} max{d1, d2} 2d (d) 1 , (185) where (a) is by substituting µ, λ and cp from (176), (41) and (175) respectively, (b) is because x ex (c) is because d = (max{d1, d2} + min{d1, d2})/2. Now combining the above result with (171) we get with probability at least 1 2d c 2d 213, 1 d1d2 ||| |||2 F 144 2e2αcpλ r||| |||F + 144e2αcpλ min{d1,d2} X j=ρ+1 σj(Θ ) . (186) Learning from Comparisons and Choices F.1 Proof of Lemma 36 From definition of L(Θ) in (39) we get, ei eli(m1,m2) ehi(m1,m2) T exp Θ i,hi(m1,m2) Θ i,li(m1,m2) + 1 , (187) where P0 = {(i, j)| 1 i < j k}. We use the matrix Bernstein inequality Tropp (2011) for the sum of independent matrices. Similar to Lemma 42, we can partition the set of all pairs P0 into (k 1) sets {Pa}a [k 1] of k/2 disjoint pairs each. Define Ya Pd1 i=1 P (m1,m2) Pa Xi,m1,m2, and Xi,m1,m2 exp Θ i,li(m1,m2) exp Θ i,hi(m1,m2) + exp Θ i,li(m1,m2) ei eli(m1,m2) ehi(m1,m2) T , a=1 Ya . (188) For a fixed value of a, it is easy to see that Xi,m1,m2 s are independent. Further, we can easily show that E h Xi,m1,m2 i = 0, and Xi,m1,m2 2 2. We also have, E h Xi,m1,m2 XT i,m1,m2 i 2 eie T i E exp Θ i,li(m1,m2) 2 exp Θ i,ui,m1 + exp Θ i,ui,m2 ui,m1, ui,m1 (a) = 2 eie T i E exp Θ iui,m1 exp Θ iui,m2 exp Θ i,ui,m1 + exp Θ i,ui,m2 2eie T i , (189) where we get (a) from the MNL model for the random choice of li(m1, m2), (b) is due to the fact that xy/(x + y)2 1/4 for all x, y > 0. Negahban, Oh, Thekumparampil, and Xu Let pi,m1,m2 exp(Θ i,ui,m1 )eui,m1 +exp(Θ i,ui,m2 )eui,m2 exp(Θ i,ui,m1 )+exp(Θ i,ui,m2 ) , then we have, E h XT i,m1,m2 Xi,m1,m2 i = E (ehi(m1,m2) pi,m1,m2)(ehi(m1,m2) pi,m1,m2)T = E h ehi(m1,m2)e T hi(m1,m2) i E pi,m1,m2p T i,m1,m2 (a) E h eui,m1e T ui,m1 + eui,m2e T ui,m2 d2 Id2 d2 , (190) where (a) comes from the fact that pi,m1,m2p T i,m1,m2 is a positive semi-definite matrix. Therefore using (189) and (190), we get i [d1] (m1,m2) Pa E h Xi,m1,m2 XT i,m1,m2 i i [d1] (m1,m2) Pa E h XT i,m1,m2 Xi,m1,m2 i Define ρ max {1/4, d1/d2}, then by the matrix Bernstein inequality Tropp (2011), a [k 1], P Ya 2 > t (d1 + d2) exp which gives a tail probability of 2d c/(k 1) for the choice of 4kρ ((1 + c) log d + log(k 1)) , 4 2((1 + c) log d + log(k 1)) For this choice of t, using union bound we can get the probabilistic bound on the derivative of log likelihood as, Lp(Θ ) 2 k 1 Ya 2 (k 1)t max a [k 1] a=1 P Ya 2 t = 2 d c , (193) Learning from Comparisons and Choices where we obtain (a) by pigeon-hole principle which implies that among a set of numbers, there should be, at the very least one number greater or equal to the average of the set of numbers and (b) by union-bound. Assuming k max{d1, d2 2/d1} log d and d1 4, we have, (c + 1) log d + log(k 1) (c + 4) log d , (194) from log(k 1) log max{d1, d2 2/d1} log d log( (d2 1 + d2 2) log d)/d1 log (4 d2 log d)/d1 3 log d. This proves the desired lemma. F.2 Proof of Lemma 37 With a slight abuse of notation, we define H as i,ui,m1 i,ui,m2 and provide a lower bound. The mean is easily computed as E h H( ) i = 1 j [d2] 2 ij 2 j [d2] ij X = 2 d1d2 ||| |||2 F , (196) where we used the fact that P j ij = 0. We want to upper bound the probability that H( ) 1 3d1d2 ||| |||2 F for some A. As in the case of k-wise ranking we using the following peeling argument used in (Negahban and Wainwright, 2012, Lemma 3), Van De Geer (2000). The strategy is to split this above event as union of many event events as follows. We construct the following family of subsets { Sℓ} such that A ℓ=1 Sℓand, Rd1 d2 ||| ||| 2α, βℓ 1µ ||| |||F βℓµ, j [d2] ij = 0 for all i [d2], and ||| |||nuc β2ℓµ where β = p 10/9 and ℓ {1, 2, 3, . . .}. This is true since, for any A, ||| |||2 F µ||| |||nuc and this implies ||| |||2 F µ||| |||F (or, ||| |||F µ). Also note that, H( ) 1 3d1d2 ||| |||2 F = 2 d1d2 ||| |||2 F H( ) 5 3d1d2 ||| |||2 F = E h H( ) i H( ) 5 3d1d2 ||| |||2 F . (198) Negahban, Oh, Thekumparampil, and Xu Therefore using union bound we get, P A s.t. H( ) 1 3d1d2 ||| |||2 F sup Sℓ (E h H( ) i H( )) 5 3d1d2 ||| |||2 F sup Sℓ (E h H( ) i H( )) 3 2d1d2 (βℓµ)2 ! sup B(βℓµ) (E h H( ) i H( )) 3 2d1d2 (βℓµ)2 ! where B(D) is defined as, Rd1 d2 ||| ||| 2α, ||| |||F D, j [d2] ij = 0 for all i [d2], and µ||| |||nuc D2 ) and (a)is true because for Sl, 5 3d1d2 ||| |||2 F 5 3d1d2 (βℓ 1µ)2 = 3 2d1d2 (βℓµ)2 , (201) and (b) is true because Sℓ B(βℓµ). Now we use following lemma to upper bound (199). Lemma 38 For 4(log d)/3 k d2 log d, sup B(D) (E h H( ) i H( )) 3 2d1d2 D2 ! 2048 α4 d1d2 2 Learning from Comparisons and Choices Proof has been relegated to Section F.3. Now by (199) and Lemma 38 we get, P A s.t. H( ) 1 3d1d2 ||| |||2 F 2048 α4 d1d2 2 ℓ=1 exp 213 9 β4ℓd1d2 2 log2 d k min2{d1, d2} 36 d1d2 2 log2 d k min2{d1, d2} ℓ=1 exp 213 ℓlog d (d) = 1/d213 (e) 2 d213 , (203) where we get (a) by substituting µ from (176), (b) by the fact that for β = p 10/9 and x 1, βx x log β x(β 1) x/32, (c) is obtained by assuming k max{d1, d2 2/d1} log d, we get (d) because we are summing an infinite geometric sequence with common ratio of 1/d213 and (e) is because for d 2, 1/d213 is less than 1/2. F.3 Proof of Lemma 38 With a slight abuse of notations, let Z sup B(D) E h H( ) i H( ) . Notice that Z is a function of d1k random variables, {ui,ℓ}i [d1],ℓ [k]. We apply the Mc Diarmid s bounded differences inequality. Let Z1 and Z2 be two realizations of Z where value of only one random variable ui ,ℓ is changed to u i ,ℓ . Also with a little more abuse of notation the two realizations of H( ) are written as H( , u1,1, . . . , ui ,ℓ , . . . , ud1,k) and H( , u1,1, . . . , u i ,ℓ , . . . , ud1,k). We let be the maximizer of max{ Z1, Z2}. Maximum Negahban, Oh, Thekumparampil, and Xu absolute difference between them is upper bounded as follows, E h H( ) i H( , u1,1, . . . , ui ,ℓ , . . . , ud1,k) E h H( ) i H( , u1,1, . . . , u i ,ℓ , . . . , ud1,k) E h H( ) i H( , u1,1, . . . , ui ,ℓ , . . . , ud1,k) E h H( ) i H( , u1,1, . . . , u i ,ℓ , . . . , ud1,k) H( , u1,1, . . . , ui ,ℓ , . . . , ud1,k) H( , u1,1, . . . , u i ,ℓ , . . . , ud1,k) (b) sup B(D) i ,ui ,ℓ i ,ui ,ℓ 2 i ,ui ,ℓ i ,u i ,ℓ d1 k 2 (k 1) (4α)2 = 32α2 d1k . (204) where (a) follows from the fact that is maximizer of max{ Z1, Z2}, (b) is due to the fact that the terms which change because of u i ,ℓ are the k 1 difference square terms between iui ,ℓ =ℓ and i, ui ,ℓ and (c) is because maximum and minimum value of difference square terms are (4α)2 and 0 respectively. Using Mc Diarmid s bounded differences inequality we get, P{ Z E h Z i ϵ} exp Learning from Comparisons and Choices because of (204) and the fact that there are d1k random variables. We upper bound E h Z i as follows. E h Z i = E sup B(D) (m1,m2) P0 E i, ui,m1 i, ui,m2 2 i, ui,m1 i, ui,m2 (a) E sup B(D) (m1,m2) P0 2 ξi,m1,m2 i, ui,m1 i, ui,m2 (b) E sup B(D) (m1,m2) Pa 2 ξi,m1,m2 i, ui,m1 i, ui,m2 a=1 E sup B(D) (m1,m2) Pa 2 ξi,m1,m2 i, ui,m1 i, ui,m2 where (a) is by standard symmetrization technique as used in k-wise ranking and {ξi,m1,m2}i [d1], m1,m2 [k] are i.i.d. Rademacher variables, (b) is due to the fact that we can partition set of all pairs into k 1 independent sets as in (188) and (c) is because of fact that supremum of sum is less than or equal to sum of supremum and the linearity of expectation. Since | i, ui,m1 i, ui,m2| 4α, we can use Ledoux-Talagrand contraction inequality Ledoux and Talagrand (2013) on (206) to get, a=1 E sup B(D) (m1,m2) Pa 2 ξi,m1,m2 i, ui,m1 i, ui,m2 a=1 E sup B(D) (m1,m2) Pa 4α 2 ξi,m1,m2 i, ui,m1 i, ui,m2 d1 k 2 E sup B(D) sup B(D) ||| |||nuc , (207) where we get (a) by putting Wi,m1,m2 = ξi,m1,m2ei(eui,m1 eui,m2)T and (b) is due to H older s inequality ( x, y |||x|||2|||y|||nuc). Now we use Bernstein s inequality Tropp (2011) to upperbound the above expectation terms. First fix a to value in [k 1]. We can easily show that Wi,m1,m2 is zero mean and, Negahban, Oh, Thekumparampil, and Xu We also get, E h Wi,m1,m2 W T i,m1,m2 i = 2eie T i E h 1 e T ui,m1eui,m2 2eie T i , (209) E h W T i,m1,m2 Wi,m1,m2 i = E h 2eui,m1e T ui,m1 2eui,m1e T ui,m2 d2 Id2 d2 2 d2 2 11d2 d2 d2 Id2 d2 . (210) Therefore, using (209) and (210), the standard deviation of P (i,m1,m2) Zi,m2,m2 is, i [d1] (m1,m2) Pa E h Wi,m2,m2 W T i,m2,m2 i i [d1] (m1,m2) Pa E h W T i,m2,m2 Wi,m2,m2 i 2 2 d1 |||I|||2, d1k 2 2 d2 |||I|||2 = kd1 min{d1, d2} . (211) By matrix Bernstein inequality Tropp (2011), a [k 1], (d1 + d2) exp t2/2 2kd1/ min{d1, d2} + which gives a tail probability of 2d c1 for the choice of 8kd1 ((1 + c1) log d) min{d1, d2} , 4 2 ((1 + c1) log d) 8kd1 ((1 + c1) log d) min{d1, d2} , when k 4(c1 + 1) log d/9 . (212) Therefore a [k 1], 8kd1 ((1 + c1) log d) min{d1, d2} + 2 dc1 Learning from Comparisons and Choices because from (208) we get P i [d1] (m1,m2) Pa Wi,m2,m2 2 P i [d1] (m1,m2) Pa Wi,m2,m2 2 d1k 2( From (207) and (213), putting c1 = 2, we get, 24 kd1 log d min{d1, d2} + sup B(D) ||| |||nuc 24 log d k d1 min{d1, d2} + 2 48 log d k d1 min{d1, d2}D2 1 k min{d1, d2} 48d1d2 2 log d d1d2 , (214) where (a) is obtained because of (200) which gives sup D B(D) ||| |||nuc D2/µ and (b) can be got by assuming k d2 log d. Using the above bound in (205) we get, P{ Z D2/(d1d2) ϵ} P{ Z E h Z i ϵ} exp and using ϵ = D2/(2d1d2) will get us the required bound. Appendix G. Proof of Bundled Choices Theorem 13 We use similar notations and techniques as the proof of Theorem 7 in Appendix C. From the definition of L(Θ) in Eq. (51), we have for the true parameter Θ , the gradient evaluated at the true parameter is i=1 (euie T vi pi) , (216) where pi denotes the conditional probability of the MNL choice for the i-th sample. Precisely, pi = P j1 Si P j2 Ti pj1,j2|Si,Tiej1e T j2 where pj1,j2|Si,Ti is the probability that the pair of items (j1, j2) is chosen at the i-th sample such that pj1,j2|Si,Ti P {(ui, vi) = (j1, j2)|Si, Ti} = eΘ j1,j2/(P j 1 Si,j 2 Ti e Θ j 1,j 2), where (ui, vi) is the pair of items selected by the i-th user among the set of pairs of alternatives Si Ti. The Hessian can be computed as 2L(Θ) Θj1,j2 Θj 1,j 2 = 1 i=1 I (j1, j2) Si Ti pj1,j2|Si,Ti Θj 1,j 2 (217) i=1 I (j1, j2), (j 1, j 2) Si Ti pj1,j2|Si,Ti I((j1, j2) = (j 1, j 2)) pj1,j2|Si,Tipj 1,j 2|Si,Ti , Negahban, Oh, Thekumparampil, and Xu We use 2L(Θ) Rd1d2 d1d2 to denote this Hessian. Let = Θ bΘ where bΘ is an optimal solution to the convex optimization in (49). We introduce the following key technical lemmas. The following lemma provides a bound on the gradient using the concentration of measure for sum of independent random matrices Tropp (2011). Lemma 39 For any positive constant c 1 and n (4(1 + c)e2αd1d2 log d)/ max{d1, d2}, with probability at least 1 2d c, ||| L(Θ )|||2 4(1 + c)e2α max{d1, d2} log d d1 d2 n . (219) Since we are typically interested in the regime where the number of samples is much smaller than the dimension d1 d2 of the problem, the Hessian is typically not positive definite. However, when we restrict our attention to the vectorized with relatively small nuclear norm, then we can prove restricted strong convexity, which gives the following bound. Lemma 40 (Restricted Strong Convexity for bundled choice modeling) Fix any Θ Ωα and assume (min{d1, d2}/ min{k1, k2}) log d n min{d5 log d, k1k2 max{d2 1, d2 2} log d}. Under the random sampling model of the alternatives {jia}i [n],a [k1] from the first set of items [d1], {jib}i [n],b [k1] from the second set of items [d2] and the random outcome of the comparisons described in section 1, with probability larger than 1 2d 225, Vec( )T 2L(Θ) Vec( ) e 2α 8 d1 d2 ||| |||2 F , (220) for all in A where A = n Rd1 d2 ||| ||| 2α , X j1 [d1],j2 [d2] j1j2 = 0 and ||| |||2 F µ ||| |||nuc o . (221) µ 210 α d1d2 log d n min{d1, d2} min{k1, k2} . (222) Building on these lemmas, the proof of Theorem 13 is divided into the following two cases. In both cases, we will show that ||| |||2 F 12 e2αc1λ d1d2 ||| |||nuc , (223) with high probability. Finally, applying an omitted result similar to Lemma 27 proves the desired theorem. We are left to show Eq. (223) holds. Case 1: Suppose ||| |||2 F µ ||| |||nuc. With = Θ bΘ, the Taylor expansion yields L(bΘ) = L(Θ ) L(Θ ), + 1 2Vec( ) 2L(Θ)Vec T ( ), (224) Learning from Comparisons and Choices where Θ = abΘ+(1 a)Θ for some a [0, 1]. It follows from Lemma 40 that with probability at least 1 2d 225, L(bΘ) L(Θ ) ||| L(Θ )|||2||| |||nuc + e 2α 8 d1 d2 ||| |||2 F . From the definition of bΘ as an optimal solution of the minimization, we have L(bΘ) L(Θ ) λ |||Θ |||nuc bΘ nuc λ||| |||nuc . By the assumption, we choose λ 8λ0. In view of Lemma 39, this implies that λ 2||| L(Θ )|||2 with probability at least 1 2d 3. It follows that with probability at least 1 2d 3 2d 225, 8d1d2 ||| |||2 F λ + ||| L(Θ )|||2 ||| |||nuc 3λ 2 ||| |||nuc . By our assumption on λ c1λ0, this proves the desired bound in Eq. (223) Case 2: Suppose ||| |||2 F µ ||| |||nuc. By the definition of µ and the fact that c1 128/ p min{k1, k2}, it follows that µ 12 e2αc1λ d1d2, and we get the same bound as in Eq. (223). G.1 Proof of Lemma 39 Define Xi = (euie T vi pi) such that L(Θ ) = (1/n) Pn i=1 Xi, which is a sum of n independent random matrices. Note that since pi is entry-wise bounded by e2α/(k1k2), |||Xi|||2 1 + e2α k1k2 , i=1 E[Xi XT i ] = i=1 (E[euie T ui] pip T i ) (225) i=1 E[euie T ui] (226) d1 Id1 d1 , (227) where the last inequality follows from the fact that for any given Si, ui will be chosen with probability at most e2α/k1, if it is in the set Si which happens with probability k1/d1. Therefore, i=1 E[Xi XT i ] i=1 E[XT i Xi] Negahban, Oh, Thekumparampil, and Xu Applying matrix Bernstein inequality Tropp (2011), we get P {||| L(Θ )|||2 > t} (d1 + d2) exp n n2t2/2 (e2αn max{d1, d2}/(d1d2)) + ((1 + (e2α/ k1k2))nt/3) which gives the desired tail probability of 2d c for the choice of 4(1 + c)e2α max{d1, d2} log d d1d2n , 4(1 + c)(1 + e2α k1k2 ) log d 4(1 + c)e2α max{d1, d2} log d where the last equality follows from the assumption, n (4(1+c)e2αd1d2 log d)/ max{d1, d2}. G.2 Proof of Lemma 40 Thee quadratic form of the Hessian defined in (218) can be lower bounded by Vec( )T 2L(Θ) Vec( ) e 2α 2 k2 1 k2 2 n j1,j2 j 1,j 2 2 | {z } H( ) which follows from Remark 31. To lower bound H( ), we first compute the mean: E[H( )] = e 2α 2 k2 1 k2 2 n j1,j2 j 1,j 2 2 d1 d2 ||| |||2 F , (233) where we used the fact that E[P j1 Si,j2 Ti j1,j2] = (k1k2/(d1d2)) P j 1 [d1],j 2 [d2] j 1,j 2 = 0 for Ω2α in (51). We now prove that H( ) does not deviate from its mean too much. Suppose there exists a A defined in (221) such that Eq. (220) is violated, i.e. H( ) < (e 2α/(8d1d2))||| |||2 F. In this case, E[H( )] H( ) 7 e 2α 8d1d2 ||| |||2 F . (234) We will show that this happens with a small probability. We use the same peeling argument as in Appendix C with Sℓ= n Rd1 d2 | ||| ||| 2α, βℓ 1µ ||| |||F βℓµ , X j1 [d1],j2 [d2] j1,j2 = 0, and ||| |||nuc β2ℓµ o , Learning from Comparisons and Choices where β = p 10/9 and for ℓ {1, 2, 3, . . .}, and µ is defined in (222). By the peeling argument, there exists an ℓ Z+ such that Sℓand E[H( )] H( ) 7 e 2α 8d1d2 β2ℓ 2(µ )2 7 e 2α 9 d1d2 β2ℓ(µ )2 . (235) Applying the union bound over ℓ Z+, P A , H( ) < e 2α 8 d1 d2 ||| |||2 F E[H( )] H( ) > 7 e 2α 9d1d2 (βℓµ )2 ) sup B (βℓµ ) E[H( )] H( ) > 7e 2α 9d1d2 (βℓµ )2 ) where we define the set B (D) such that Sℓ B (βℓµ ): B (D) = Rd1 d2 2α, ||| |||F D, X j1 [d1],j2 [d2] j1j2 = 0, µ ||| |||nuc D2 o . The following key lemma provides the upper bound on this probability. Lemma 41 For (min{d1, d2}/ min{k1, k2}) log d n d5 log d, E[H( )] H( ) e 2αD2 exp n n min{k2 1, k2 2} k1k2 D4 210α4d2 1d2 2 Let η = exp nk1k2 min{k2 1,k2 2}(β 1.002)(µ )4 210α4d2 1d2 2 . Applying the tail bound to (236), we get P A , H( ) < e 2α 8 d1d2 ||| |||2 F ℓ=1 exp n n k1k2 min{k2 1, k2 2} (βℓµ )4 210α4d2 1d2 2 ℓ=1 exp n nk1k2 min{k2 1, k2 2}ℓ(β 1.002)(µ )4 210α4d2 1d2 2 where (a) holds because βx x log β x(β 1.002) for the choice of β = p 10/9. By the definition of µ , η = exp n 230 k1k2 max{d2 2, d2 1}(log d)2(β 1.002) o exp{ 225 log d} , where the last inequality follows from the assumption that n k1k2 max{d2 1, d2 2} log d, and β 1.002 2 5. Since for d 2, exp{ 225 log d} 1/2 and thus η 1/2, the lemma follows by assembling the last two displayed inequalities. Negahban, Oh, Thekumparampil, and Xu G.3 Proof of Lemma 41 Let Z sup B (D) E[H( )] H( ) and consider the tail bound using Mc Diarmid s inequality. Note that Z has a bounded difference of (8α2e 2α max{k1, k2})/(k2 1k2 2n) when one of the k1k2n independent random variables are changed, which gives P {Z E[Z] t} exp k4 1k4 2n2t2 64α4e 4α max{k2 1, k2 2}k1k2n With the choice of t = D2/(4e2α d1d2), this gives P Z E[Z] e 2α 4d1d2 D2 exp k3 1k3 2n D4 210α4d2 1d2 2 max{k2 1, k2 2} We first construct a partition of the space similar to Lemma 33. Let k min{k1, k2} . (241) Lemma 42 There exists a partition (T1, . . . , TN) of {[k1] [k2]} {[k1] [k2]} for some N 2k2 1k2 2/ k such that Tℓ s are disjoint subsets, S ℓ [N] Tℓ= {[k1] [k2]} {[k1] [k2]}, |Tℓ| k and for any ℓ [N] the set of random variables in Tℓsatisfy {( ji,a,ji,b ji,a ,ji,b )2}i [n],((a,b),(a ,b )) Tℓare mutually independent . where ji,a for i [n] and a [k1] denote the a-th chosen item to be included in the set Si. Now we prove an upper bound on E[Z] using the symmetrization technique. Recall that ji,a is independently and uniformly chosen from [d1] for i [n] and a [k1]. Similarly, ji,b is independently and uniformly chosen from [d1] for i [n] and b [k2]. 2 k2 1 k2 2 n E a,a [k1] b,b [k2] E ji,a,ji,b ji,a ,ji,b 2 ji,a,ji,b ji,a ,ji,b 2 2 k2 1 k2 2 n (j1,j2,j 1,j 2) Tℓ E j1,j2 j 1,j 2 2 j1,j2 j 1,j 2 2 k2 1 k2 2 n (j1,j2,j 1,j 2) Tℓ ξi,j1,j2,j 1,j 2 j1,j2 j 1,j 2 2 where the first inequality follows for the fact that the supremum of the sum is smaller than the sum of supremum, and the second inequality follows from standard symmetrization Learning from Comparisons and Choices with i.i.d. Rademacher random variables ξi,j1,j2,j 1,j 2 s. It follows from Ledoux-Talagrand contraction inequality Ledoux and Talagrand (2013) that (j1,j2,j 1,j 2) Tℓ ξi,j1,j2,j 1,j 2 j1,j2 j 1,j 2 2 (j1,j2,j 1,j 2) Tℓ ξi,j1,j2,j 1,j 2 j1,j2 j 1,j 2 sup B (D) ||| |||nuc (j1,j2,j 1,j 2) Tℓ ξi,j1,j2,j 1,j 2 ej1,j2 ej 1,j 2 (j1,j2,j 1,j 2) Tℓ ξi,j1,j2,j 1,j 2 ej1,j2 ej 1,j 2 where the second inequality follows for the H older s inequality and the last inequality follows from µ ||| |||nuc D2 for all B (D). To bound the expected spectral norm of the random matrix, we use matrix Bernstein s inequality. Note that ξi,j1,j2,j 1,j 2c 2 2 almost surely, E[(ej1,j2 ej 1,j 2)(ej1,j2 ej 1,j 2)T ] (2/d1)Id1 d1, and E[(ej1,j2 ej 1,j 2)T (ej1,j2 ej 1,j 2)] (2/d2)Id2 d2. It follows that σ2 = 2n|Tℓ|/ min{d1, d2}, where |Tℓ| min{k1, k2}. It follows that (j1,j2,j 1,j 2) Tℓ ξi,j1,j2,j 1,j 2 ej1,j2 ej 1,j 2 (d1 + d2) exp n t2/2 2n min{k1, k2}/min{d1, d2} + Choosing t = max{ p 64n(min{k1, k2}/ min{d1, d2}) log d, (16 2/3) log d}, we obtain a bound on the spectral norm of t with probability at least 1 2d 7. From the fact that Pn i=1 P (j1,j2,j 1,j 2) Tℓξi,j1,j2,j 1,j 2 ej1,j2 ej 1,j 2 2 (n/ 2) min{k1, k2}, it follows that (j1,j2,j 1,j 2) Tℓ ξi,j1,j2,j 1,j 2 ej1,j2 ej 1,j 2 64 n min{k1, k2} log d min{d1, d2} , (16 2/3) log d o + 2n min{k1, k2} 66 n min{k1, k2} log d min{d1, d2} (249) Negahban, Oh, Thekumparampil, and Xu which follows form the assumption that n min{k1, k2} min{d1, d2} log d and n d5 log d. Substituting this bound in (242), and (246), we get that E[Z] 16e 2ααD2 66 log d n min{k1, k2} min{d1, d2} (250) 4 d1d2 . (251) Appendix H. Proof of the Information-theoretic Lower Bound in Theorem 16 This proof follow closely the proof of Theorem 10 in Appendix E. We apply the generalized Fano s inequality in the same way to get Eq. (160) P n b L = L o 1 M 2 1 P ℓ1,ℓ2 [M] DKL(Θ(ℓ1) Θ(ℓ2)) + log 2 log M , (252) The main challenge in this case is that we can no longer directly apply the RUM interpretation to compete DKL(Θ(ℓ1) Θ(ℓ2)). This will result in over estimating the KL-divergence, because this approach does not take into account that we only take the top winner, out of those k1k2 alternatives. Instead, we compute the divergence directly, and provide an appropriate bound. Let the set of k1 rows and k2 columns chosen in one of the n samples Learning from Comparisons and Choices be S [d1] and T [d2] respectively. Then, DKL(Θ(ℓ1) Θ(ℓ2)) (a) = n d1 k1 d2 k2 X eΘ(ℓ1) ij P i S j T eΘ(ℓ1) i j log eΘ(ℓ1) ij P i S j T eΘ(ℓ2) i j eΘ(ℓ2) ij P i S j T eΘ(ℓ1) i j (b) n d1 k1 d2 k2 X e2Θ(ℓ1) ij P i ,j eΘ(ℓ2) i j eΘ(ℓ1) ij +Θ(ℓ2) ij P i ,j eΘ(ℓ1) i j eΘ(ℓ2) ij P i ,j eΘ(ℓ1) i j 2 k2 1k2 2 d1 k1 d2 k2 X e2Θ(ℓ1) ij Θ(ℓ2) ij X i ,j eΘ(ℓ2) i j eΘ(ℓ1) ij X i ,j eΘ(ℓ1) i j k2 1k2 2 d1 k1 d2 k2 X i ,j eΘ(ℓ2) i j X eΘ(ℓ1) ij eΘ(ℓ2) ij 2 eΘ(ℓ2) ij X i,j (eΘ(ℓ1) ij eΘ(ℓ2) ij ) 2 k1k2 d1 k1 d2 k2 X eΘ(ℓ1) ij eΘ(ℓ2) ij 2 k1k2 d1 k1 d2 k2 X Θ(ℓ1) ij Θ(ℓ2) ij 2 Θ(ℓ1) ij Θ(ℓ2) ij 2 Here (a) is by definition of KL-distance and the fact that S, T are chosen uniformly from all possible such sets and (b) is due to the fact that log(x) x 1 with x = (eΘ(ℓ1) ij P i S,j T eΘ(ℓ2) i j )/(eΘ(ℓ2) ij P i S,j T eΘ(ℓ1) i j ). The constants at (c) is due to the fact that each element of Θ(ℓ1) is upper bounded by α and lower bounded by α. We can get (d) by removing the second term which is always negative, and using the bond of α. (e) is obtained because ex where α x α is Lipschitz continuous with Lipschitz constant eα. At last (f) is obtained by simple counting of the occurrences of each ij. Thus we have, P n b L = L o 1 M 2 1 P ℓ1,ℓ2 [M] ne5α Θ(ℓ2) ij Θ(ℓ2) ij 2 log M , (253) The remainder of the proof relies on the following probabilistic packing. Lemma 43 Let d2 d1 be sufficiently large positive integers. Then for each r {1, . . . , d1}, and for any positive δ > 0 there exists a family of d1 d2 dimensional matrices {Θ(1), . . . , Θ(M(δ))} with cardinality M(δ) = (1/4) exp(rd2/576) such that each matrix is rank r and Negahban, Oh, Thekumparampil, and Xu the following bounds hold: Θ(ℓ) F δ , for all ℓ [M] (254) Θ(ℓ1) Θ(ℓ2) F 1 2δ , for all ℓ1, ℓ2 [M] (255) Θ(ℓ) Ω α , for all ℓ [M] , (256) with α = (8δ/d2) 2 log d for d = (d1 + d2)/2. Suppose δ αd2/(8 p 2 log d) such that the matrices in the packing set are entry-wise bounded by α, then the above lemma 43 implies that Θ(ℓ1) Θ(ℓ2) 2 F 4δ2, which gives P n b L = L o 1 d1d2 + log 2 rd2 576 2 log 2 1 where the last inequality holds for δ2 (rd1d2 2/(1152e5αn)) and assuming rd2 1600. Together with (257) and (255), this proves that for all δ min{αd2/(8 p 2 log d), p rd1d2 2/(9216 e5αn)}, inf bΘ sup Θ Ωα E h bΘ Θ F Choosing δ appropriately to maximize the right-hand side finishes the proof of the desired claim. Also by symmetry, we can apply the same argument to get similar bound with d1 and d2 interchanged. H.1 Proof of Lemma 43 We show that the following procedure succeeds in producing the desired family with probability at least half, which proves its existence. Let d = (d1 + d2)/2, and suppose d2 d1 without loss of generality. For the choice of M = erd2/576, and for each ℓ [M ], generate a rank-r matrix Θ(ℓ) Rd1 d2 as follows: Θ(ℓ) = δ rd2 U(V (ℓ))T δ rd2 1T U(V (ℓ))T 1 d1d2 11T , (258) where U Rd1 r is a random orthogonal basis such that UT U = Ir r and V (ℓ) Rd2 r is a random matrix with each entry V (ℓ) ij { 1, +1} chosen independently and uniformly at random. By construction, notice that Θ(ℓ) F (δ/ rd2) U(V (ℓ))T F = δ. Now, by triangular inequality, we have Θ(ℓ1) Θ(ℓ2) F U(V (ℓ1) V (ℓ2))T F δ |1T U(V (ℓ1) V (ℓ2))T 1| V (ℓ1) V (ℓ2) F | {z } A |1T U(V (ℓ1))T 1| | {z } B +|1T U(V (ℓ2))T 1| . Learning from Comparisons and Choices We will prove that the first term is bounded by A rd2 with probability at least 7/8 for all M matrices, and we will show that we can find M matrices such that the second term is bounded by B 8 p 2rd2 log(32r) log(32d) with probability at least 7/8. Together, this proves that with probability at least 3/4, there exists M matrices such that Θ(ℓ1) Θ(ℓ2) F δ 1 27 log(32r) log(32d) for all ℓ1, ℓ2 [M] and for sufficiently large d1 and d2. Applying similar Mc Diarmid s inequality as Eq. (122) in Appendix E, it follows that A2 rd2 with probability at least 7/8 for M = erd2/576 and a sufficiently large d2. To prove a bound on B, we will show that for a given ℓ, P n |1T U(V (ℓ))T 1| 8 p 2rd2 log(32r) log(32d) o 7 Then using the similar technique as in (125), it follows that we can find M = (1/4)M matrices all satisfying this bound and also the bound on the max-entry in (260). We are left to prove (259). We apply a series of concentration inequalities. Let H1 be the event that {| V (ℓ) i , 1 | p 2d2 log(32r) for all i [r]}. Then, applying the standard Hoeffding s inequality, we get that P {H1} 15/16, where V (ℓ) i is the i-th column of V (ℓ). We next change the variables and represent 1T U as d1u T U, where u is drawn uniformly at random from the unit sphere and U is a r dimensional subspace drawn uniformly at random. By symmetry, d1u T U have the same distribution as 1T U. Let H2 be the event that {| Ui, (V (ℓ))T 1 | p 16r(d2/d1) log(32r) log(32d) for all i [d1]}, where Ui is the i-th row of U. Then, applying Levy s theorem for concentration on the sphere Ledoux (2005), we have P {H2|H1} 15/16. Finally, let H3 be the event that {| d1 u, U(V (ℓ))T 1| 8 p 2rd2 log(32r) log(32d)}. Then, again applying Levy s concentration, we get P {H3|H1, H2} 15/16. Collecting all three concentration inequalities, we get that with probability at least 13/16, |1T U(V (ℓ))T 1| 8 p 2rd2 log(32r) log(32d), which proves Eq. (259). We are left to prove that Θ(ℓ) s are in Ω(8δ/d2) 2 log d2 as defined in (51). Similar to Eq. (124), applying Levy s concentration gives P max i,j |Θ(ℓ) ij | 2δ 32 log d2 1 2 exp n 2 log d2 o 1 for a fixed ℓ [M ]. Then using the similar technique as in (125), it follows that there exists M = (1/4)M matrices all satisfying this bound and also the bound on B in Eq. (259). A. Agarwal, S. Negahban, and M. Wainwright. Fast global convergence rates of gradient methods for high-dimensional statistical recovery. In In NIPS, pages 37 45, 2010. A. Agarwal, P. Patil, and S. Agarwal. Accelerated spectral ranking. In International Conference on Machine Learning, pages 70 79, 2018. Negahban, Oh, Thekumparampil, and Xu H. Azari Soufiani, D. C. Parkes, and L. Xia. Random utility theory for social choice. In NIPS, pages 126 134, 2012. H. Azari Soufiani, W. Chen, D. C. Parkes, and L. Xia. Generalized method-of-moments for rank aggregation. In Advances in Neural Information Processing Systems 26, pages 2706 2714, 2013. H. Azari Soufiani, D. Parkes, and L. Xia. Computing parametric ranking models via rankbreaking. In Proceedings of The 31st International Conference on Machine Learning, pages 360 368, 2014. J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA journal of numerical analysis, 8(1):141 148, 1988. M. E. Ben-Akiva and S. R. Lerman. Discrete choice analysis: theory and application to travel demand, volume 9. MIT press, 1985. A. R. Benson, R. Kumar, and A. Tomkins. A discrete choice model for subset selection. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 37 45. ACM, 2018. S. Bhojanapalli and P. Jain. Universal matrix completion. ar Xiv preprint ar Xiv:1402.2324, 2014. J. Blanchet, G. Gallego, and V. Goyal. A markov chain approximation to choice modeling. In EC, pages 103 104, 2013. C. Borgs, J. Chayes, C. E. Lee, and D. Shah. Iterative collaborative filtering for sparse matrix estimation. ar Xiv preprint ar Xiv:1712.00710, 2017. V. S. Borkar, N. Karamchandani, and S. Mirani. Randomized Kaczmarz for rank aggregation from pairwise comparisons. In Information Theory Workshop (ITW), 2016 IEEE, pages 389 393. IEEE, 2016. S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013. R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324 345, 1955. J.-F. Cai, E. J. Cand es, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956 1982, 2010. E. J. Cand es and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717 772, 2009. X. Chen, P. N. Bennett, K. Collins-Thompson, and E. Horvitz. Pairwise ranking aggregation in a crowdsourced setting. In Proceedings of the sixth ACM international conference on Web search and data mining, pages 193 202. ACM, 2013. Learning from Comparisons and Choices Y. Chen and C. Suh. Spectral mle: Top-k rank aggregation from pairwise comparisons. In International Conference on Machine Learning, pages 371 380, 2015. Y. Chen, J. Fan, C. Ma, and K. Wang. Spectral method and regularized mle are both optimal for top-k ranking. ar Xiv preprint ar Xiv:1707.09971, 2017. F. Chierichetti, R. Kumar, and A. Tomkins. Learning a mixture of two multinomial logits. In International Conference on Machine Learning, pages 960 968, 2018. C. Chu, P. Leslie, and A. Sorensen. Bundle-size pricing as an approximation to mixed bundling. The American Economic Review, pages 263 303, 2011. S. Cl emen con, G. Lugosi, and N. Vayatis. Ranking and scoring using empirical risk minimization. In International Conference on Computational Learning Theory, pages 1 15. Springer, 2005. M. A. Davenport, Y. Plan, E. van den Berg, and M. Wootters. 1-bit matrix completion. Information and Inference, 3(3):189 223, 2014. P. Diaconis and R. L. Graham. Spearman s footrule as a measure of disarray. Journal of the Royal Statistical Society. Series B (Methodological), pages 262 268, 1977. M. Falahatgar, A. Jain, A. Orlitsky, V. Pichapati, and V. Ravindrakumar. The limits of maxing, ranking, and preference learning. In International Conference on Machine Learning, pages 1426 1435, 2018. V. F. Farias, S. Jagabathula, and D. Shah. A data-driven approach to modeling choice. In NIPS, pages 504 512, 2009. R. Ge, J. D. Lee, and T. Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973 2981, 2016. D. F. Gleich and L.-h. Lim. Rank aggregation via nuclear norm minimization. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 60 68. ACM, 2011. K. Goldberg, T. Roeder, D. Gupta, and C. Perkins. Eigentaste: A constant time collaborative filtering algorithm. Information Retrieval, 4(2):133 151, 2001. I. C. Gormley and T. B. Murphy. A grade of membership model for rank data. Bayesian Analysis, 4(2):265 295, 2009. P. M. Guadagni and J. D. Little. A logit model of brand choice calibrated on scanner data. Marketing science, 2(3):203 238, 1983. B. Hajek, S. Oh, and J. Xu. Minimax-optimal inference from partial rankings. In Advances in Neural Information Processing Systems, pages 1475 1483, 2014. R. Heckel, M. Simchowitz, K. Ramchandran, and M. J. Wainwright. Approximate ranking from pairwise comparisons. ar Xiv preprint ar Xiv:1801.01253, 2018. Negahban, Oh, Thekumparampil, and Xu D. R. Hunter. MM algorithms for generalized Bradley-Terry models. Annals of Statistics, pages 384 406, 2004. P. Jain and S. Oh. Provable tensor factorization with missing data. preprint ar Xiv:1406.2784, 2014. P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimization. In STOC, pages 665 674, 2013. M. Jang, S. Kim, C. Suh, and S. Oh. Top-k ranking from pairwise comparisons: When spectral ranking is optimal. ar Xiv preprint ar Xiv:1603.04153, 2016. M. Jang, S. Kim, C. Suh, and S. Oh. Optimal sample complexity of m-wise data for top-k ranking. In Advances in Neural Information Processing Systems, pages 1685 1695, 2017. L. R. F. Jr. Solution of a ranking problem from binary comparisons. The American Mathematical Monthly, 64(8):28 33, 1957. S. Katariya, L. Jain, N. Sengupta, J. Evans, and R. Nowak. Adaptive sampling for coarse ranking. ar Xiv preprint ar Xiv:1802.07176, 2018. J. Katz-Samuels and C. Scott. Nonparametric preference completion. ar Xiv preprint ar Xiv:1705.08621, 2017. E. Kazemi, L. Chen, S. Dasgupta, and A. Karbasi. Comparison based learning from weak oracles. ar Xiv preprint ar Xiv:1802.06942, 2018. R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. Information Theory, IEEE Transactions on, 56(6):2980 2998, 2010a. R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. Journal of Machine Learning Research, 11(2057-2078):1, 2010b. A. Khetan and S. Oh. Data-driven rank breaking for efficient rank aggregation. In Proceedings of the 33rd International Conference on Machine Learning, volume 48, 2016a. A. Khetan and S. Oh. Computational and statistical tradeoffs in learning to rank. In Advances in Neural Information Processing Systems 29, pages 739 747, 2016b. M. Ledoux. The concentration of measure phenomenon. American Mathematical Soc., 2005. M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013. D. R. Luce. Individual Choice Behavior. Wiley, New York, 1959. L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579 2605, 2008. J. Marschak. Binary-choice constraints and random utility indicators. In Proceedings of a symposium on mathematical methods in the social sciences, volume 7, pages 19 38, 1960. Learning from Comparisons and Choices A. K. Massimino and M. A. Davenport. As you like it: Localization via paired comparisons. ar Xiv preprint ar Xiv:1802.10489, 2018. L. Maystre and M. Grossglauser. Fast and accurate inference of Plackett Luce models. In Advances in Neural Information Processing Systems, pages 172 180, 2015. D. Mc Fadden. Conditional logit analysis of qualitative choice behavior. Frontiers in Econometrics, pages 105 142, 1973. D. Mc Fadden. Econometric models for probabilistic choice among products. Journal of Business, 53(3):S13 S29, 1980. D. Mc Fadden and K. Train. Mixed mnl models for discrete response. Journal of applied Econometrics, 15(5):447 470, 2000. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. ar Xiv preprint ar Xiv:1301.3781, 2013a. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111 3119, 2013b. S. Negahban and M. J. Wainwright. Restricted strong convexity and (weighted) matrix completion: Optimal bounds with noise. Journal of Machine Learning Research, 2012. S. Negahban, B. Yu, M. J. Wainwright, and P. K. Ravikumar. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. In Advances in Neural Information Processing Systems, pages 1348 1356, 2009. S. Negahban, S. Oh, and D. Shah. Iterative ranking from pair-wise comparisons. In NIPS, pages 2483 2491, 2012. S. Oh and D. Shah. Learning mixed multinomial logit model from ordinal data. In Advances in Neural Information Processing Systems, pages 595 603, 2014. G. Ongie, L. Balzano, D. Pimentel-Alarc on, R. Willett, and R. D. Nowak. Tensor methods for nonlinear matrix completion. ar Xiv preprint ar Xiv:1804.10266, 2018. A. Pananjady, C. Mao, V. Muthukumar, M. J. Wainwright, and T. A. Courtade. Worstcase vs average-case design for estimation from fixed pairwise comparisons. ar Xiv preprint ar Xiv:1707.06217, 2017. D. Park, J. Neeman, J. Zhang, S. Sanghavi, and I. S. Dhillon. Preference completion: Large-scale collaborative ranking from pairwise comparisons. Proceedings of The 32nd International Conference on Machine Learning, 2015. R. L. Plackett. The analysis of permutations. Applied Statistics, pages 193 202, 1975. S. Ragain and J. Ugander. Pairwise choice Markov chains. In Advances in Neural Information Processing Systems, pages 3198 3206, 2016. Negahban, Oh, Thekumparampil, and Xu A. Rajkumar and S. Agarwal. A statistical convergence perspective of algorithms for rank aggregation from pairwise data. In Proceedings of The 31st International Conference on Machine Learning, pages 118 126, 2014. P. Ray. Independence of irrelevant alternatives. Econometrica: Journal of the Econometric Society, pages 987 991, 1973. B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471 501, 2010. W. Rui, J. Xu, S. Rayadurgam, M. Lelarge, L. Massouli e, and B. Hajek. Clustering and inference from pairwise comparisons. In SIGMETRICS 15 Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, volume 43, page 2, 2015. N. B. Shah, S. Balakrishnan, J. Bradley, A. Parekh, K. Ramchandran, and M. Wainwright. When is it better to compare than to score? ar Xiv preprint ar Xiv:1406.6618, 2014. N. B. Shah, S. Balakrishnan, J. Bradley, A. Parekh, K. Ramchandran, and M. J. Wainwright. Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence. The Journal of Machine Learning Research, 17(1):2049 2095, 2016a. N. B. Shah, S. Balakrishnan, A. Guntuboyina, and M. J. Wainwright. Stochastically transitive models for pairwise comparisons: Statistical and computational issues. In International Conference on Machine Learning, 2016b. P. Sham and D. Curtis. An extended transmission/disequilibrium test (tdt) for multi-allele marker loci. Annals of human genetics, 59(3):323 336, 1995. O. Tamuz, C. Liu, S. Belongie, O. Shamir, and A. T. Kalai. Adaptively learning the crowd kernel. ar Xiv preprint ar Xiv:1105.1033, 2011. L. L. Thurstone. A law of comparative judgment. Psychological review, 34(4):273, 1927. K. Train. Qualitative choice analysis: Theory, econometrics, and an application to automobile demand, volume 10. MIT press, 1986. J. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Comput. Math., 2011. J. A. Tropp. An introduction to matrix concentration inequalities. ar Xiv preprint ar Xiv:1501.01571, 2015. A. Tsokos, S. Narayanan, I. Kosmidis, G. Baio, M. Cucuringu, G. Whitaker, and F. J. Kir aly. Modeling outcomes of soccer matches. ar Xiv preprint ar Xiv:1807.01623, 2018. S. Van De Geer. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000. M. Vojnovic and S. Yun. Parameter estimation for generalized thurstone choice models. In International Conference on Machine Learning, pages 498 506, 2016. Learning from Comparisons and Choices J. Walker and M. Ben-Akiva. Generalized random utility model. Mathematical Social Sciences, 43:303 343, 2002. M. Wilber, S. Kwak, and S. Belongie. Cost-effective HITs for relative similarity comparisons. In Human Computation and Crowdsourcing (HCOMP), Pittsburgh, 2014. URL http: //arxiv.org/abs/1404.3291. M. Yuan and C. Zhang. On tensor completion via nuclear norm minimization. ar Xiv preprint ar Xiv:1405.1773, 2014. E. Zermelo. Die berechnung der turnier-ergebnisse als ein maximumproblem der wahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 29(1):436 460, 1929.