# unfoldingmodelbased_visualization_theory_method_and_applications__703d25d5.pdf Journal of Machine Learning Research 22 (2021) 1-51 Submitted 12/18; Revised 10/20; Published 1/21 Unfolding-Model-Based Visualization: Theory, Method and Applications Yunxiao Chen y.chen186@lse.ac.uk Department of Statistics London School of Economics and Political Science London, WC2A 2AE, UK Zhiliang Ying zying@stat.columbia.edu Department of Statistics Columbia University New York, NY 10027, USA Haoran Zhang hrzhang16@fudan.edu.cn Shanghai Center for Mathematical Sciences Fudan University Shanghai, 200433, China Editor: Tina Eliassi-Rad Multidimensional unfolding methods are widely used for visualizing item response data. Such methods project respondents and items simultaneously onto a low-dimensional Euclidian space, in which respondents and items are represented by ideal points, with personperson, item-item, and person-item similarities being captured by the Euclidian distances between the points. In this paper, we study the visualization of multidimensional unfolding from a statistical perspective. We cast multidimensional unfolding into an estimation problem, where the respondent and item ideal points are treated as parameters to be estimated. An estimator is then proposed for the simultaneous estimation of these parameters. Asymptotic theory is provided for the recovery of the ideal points, shedding lights on the validity of model-based visualization. An alternating projected gradient descent algorithm is proposed for the parameter estimation. We provide two illustrative examples, one on users movie rating and the other on senate roll call voting. Key words: Multidimensional Unfolding, Data Visualization, Distance Matrix Completion, Item Response Data, Embedding 1. Introduction Multidimensional unfolding (MDU) methods are widely used as an important data visualization tool in social and behavioral sciences such as psychology (Van Deun et al., 2007; Papesh and Goldinger, 2010), political science (Poole, 2000, 2005; Clinton et al., 2004a; Bakker and Poole, 2013), and marketing (De Sarbo and Hoffman, 1987; De Sarbo et al., 1997; Ho et al., 2010). It is regarded as the dominant method in the scaling of both preferential choice and attitude (Mair et al., 2015). The basic idea of MDU is to place both respondents and items in a joint Euclidean space based on data, with the understanding that respondents tend to prefer items that are close to them in the space. This joint visualization may lead to better c 2021 Yunxiao Chen, Zhiliang Ying and Haoran Zhang. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/18-846.html. Chen, Ying and Zhang understanding and interpretations of both the respondents and the items, as compared with separately visualizing the respondents and the items by themselves. MDU has its origin in psychology (Bennett, 1956; Bennett and Hays, 1960; Hays and Bennett, 1961; Coombs, 1964). It is closely related to multidimensional scaling (MDS) methods (Kruskal, 1964; Kruskal and Wish, 1978; Borg and Groenen, 2005) and several other recent approaches to nonlinear dimension reduction and manifold learning (Tenenbaum et al., 2000; Lu et al., 2005; Chen and Buja, 2009; Zhang et al., 2016). MDU methods can be categorized into two types, algorithm-based and model-based. Algorithm-based methods (e.g., Takane et al., 1977; Greenacre and Browne, 1986; de Leeuw and Mair, 2009) estimate the ideal points by minimizing a certain objective function, also known as the stress function in the literature of MDU. The classical algorithm-based methods have been implemented in the R package smacof (de Leeuw and Mair, 2009) that is widely used for MDU and MDS analysis. Model-based methods (e.g., De Sarbo and Hoffman, 1987; Hinich, 2005; Bakker and Poole, 2013), however, infer the locations of the ideal points by making use of a probabilistic model. Such a model typically assumes that, up to some measurement error, the similarity between a person and an item is a decreasing function of some defined distance between the corresponding ideal points. The specification of MDU models is closely related to item response theory models in psychometrics (see e.g., Embretson and Reise, 2000; Rabe-Hesketh and Skrondal, 2004; Bartholomew et al., 2011). The MDU problem is closely related to MDS. The key difference is that data for the former do not contain direct measurement of within-set (i.e., person-person and item-item) similarities, while data for MDS typically have such information. Largely due to the missing information contained in the within-set similarities, the MDU problem tends to be more challenging. As a result, degenerate solutions are often encountered in the applications of MDU methods, in which case the visualization and the corresponding interpretations convey no information (e.g., Busing et al., 2005; Borg and Groenen, 2005), while MDS results tend to be more stable. These empirical observations suggest that it is of importance to study the validity of MDU solutions, which motivates the research in this paper. This paper studies the visualization of MDU from the statistical perspective. First, for binary choice data, we formulate the MDU problem into a parameter estimation problem under a general family of probabilistic MDU models, where the respondent and item ideal points are treated as parameters to be estimated. Second, an estimator is proposed for the ideal points and an asymptotic theory is provided for this estimator, shedding lights on the validity of model-based visualization. Finally, an efficient alternating projected gradient algorithm is proposed for the computation which is scalable to large-scale problems. We illustrate the proposed method through two applications, one on movie rating and the other on senate roll call voting. The movie dataset is a subset from the famous Movie Lens dataset (Harper and Konstan, 2016). We unfold the 943 users and 338 movies in the dataset. Specifically, we study the users movie watching decisions. Based on the ideal points of movies in a two-dimensional space, it is found that one dimension of the space corresponds to the popularity of the movies and the other dimension corresponds to the release date of the movies. Good understanding of the user ideal points is further obtained based on their distances to the movie ideal points. The senate voting dataset is based on the senate roll call voting records from the 108th congress in 2003-2004. Based on the unfolding of the senators and roll calls, it is found that most of the ideal points lie around Unfolding-Model-Based Visualization: Theory, Method and Applications a one-dimensional line, with the two extremes of the line representing the most liberal and the most conservative political standings. The rest of the paper is organized as follows. In Section 2, we introduce a family of MDU models and formulate the problem of joint configuration recovery into an estimation problem. In Section 3, we propose an estimator, for which statistical theory is established that guarantees the consistency of configuration recovery under reasonable conditions. Simulation studies and real data examples are presented in Sections 4 and 5, respectively. We end with discussions on future directions in Section 6. An application to cluster analysis, proofs of the theoretical results, and numerical comparison with classical MDU methods are provided as supplementary materials. 2. Distance-based MDU 2.1 Distance-based Unfolding Model for Binary Data Consider N respondents making choice on J binary items (e.g., agree/disagree ). Let Yij be a random variable, denoting the response from respondent i to item j, taking value 0 or 1, and let yij be its realization. For example, such data can come from senate roll call voting, where the respondents are senators and the items correspond to roll calls. Response Yij = 1 means that senator i supports roll call j and Yij = 0 otherwise. We provide a simulated example in Figure 1 to illustrate MDU analysis. Panel (a) shows the heat map of an observed response matrix which consists of 20 respondents and 10 items, where 0 and 1 responses are represented by red and yellow colors, respectively. Given choice data in panel (a), an MDU method aims at representing respondents and items by ideal points in the same low-dimensional Euclidian space RK as in panel (b) of Figure 1 that can be easily visualized, where the respondent-respondent, respondent-item, and item-item relationships are captured by the between-points distance. The dimension K of the Euclidian space is often set to be 2 or 3 for the purpose of visualization. One way to conduct MDU is via a statistical model. An MDU model typically assumes that each respondent/item is associated with a true ideal point in RK that is represented by a K-dimensional parameter vector. Let θi = (θi1, ..., θi K) and aj = (aj1, ..., aj K) denote the parameter vectors of respondent i and item j, respectively. It is assumed that response Yij is determined by the Euclidian distance between θi and aj in RK. Finally, we use ΘN = (θik)N K and AJ = (ajk)J K to denote the matrices containing all the person and the item ideal points, respectively. Under such a statistical model, the goal of MDU becomes to estimate the person and item parameters based on data. In this paper, we focus on MDU models taking the form P(Yij = 1 | θi, aj) = f( θi aj 2), (1) where denotes the standard L2 norm and f : [0, ) [0, 1] is a pre-specified link function. It is assumed that the responses Yij are conditionally independent, given the ideal points θi and aj, i = 1, ..., N, j = 1, ..., J. This model falls under the general framework of the MDU threshold model for binary choice data (see De Sarbo and Hoffman, 1987). According to the form of (1), the distribution of data only depends on the squared distance between every pair of person and item ideal points, dij = θi aj 2, i = 1, ..., N, j = Chen, Ying and Zhang 1 2 3 4 5 6 7 8 9 20 16 12 9 6 3 1.0 0.5 0.0 0.5 1.0 1.5 1.5 0.5 0.5 1st dimension 2nd dimension Figure 1: An illustrative example. Panel (a): The heatmap of a response matrix, where 0 and 1 responses are represented by red and yellow colors, respectively. Panel (b): The respondent and item ideal points where black circles represent respondents and red triangles represent items. 1, ..., J. The matrix DN,J = (dij)N J is known as the corresponding partial distance matrix, where the subscripts of DN,J emphasize the dependence of this matrix on the numbers of respondents and items. In addition, the link function f is often assumed to be a monotone decreasing function, so that a larger distance implies a lower probability of Yij = 1. An example of such a link function is f(x) = 2/(1 + exp(x)). When f(x) takes this form, P(Yij = 1 | θi, aj) = 1 when the distance between θi and aj is 0, i.e., the two points are identical, and the probability P(Yij = 1 | θi, aj) decays towards 0 when the distance increases. In what follows, we provide two remarks on this modeling framework. Remark 1 We remark on the link function f which plays a similar role as the dissimilarity transformation function in the classical MDS and MDU methods (e.g., Chapter 9, Borg and Groenen, 2005). Assuming a pre-specified f is similar to assuming an identity transformation in classical MDU. In classical MDS and MDU, the dissimilarity transformation function can be unknown and estimated from data parametrically or non-parametrically. Similar treatment can be applied to the link function f. For example, one may assume f( θi aj 2) = g(β0 + β1 θi aj 2), where g : R [0, 1] is a given monotone decreasing function and β0 and β1 are additional parameters to be estimated from data together with the personand item-specific parameters. This form is similar in spirit to the interval transformation in classical MDU. When no constraint is imposed on the scales of θis and ajs, β1 needs to be fixed to be a constant (e.g., β1 = 1) for model identifiability. One may also estimate f non-parametrically, for example, by using monotone splines. Unfolding-Model-Based Visualization: Theory, Method and Applications Under suitable regularity conditions, our theoretical development in Section 3 can be extended to the case when f also needs to be estimated from data. Remark 2 Although we focus on binary data, the introduced modeling framework can be easily extended to other types of preference data, such as rating and ranking data. For example, consider rating data Yij {0, ..., T}, where 0, 1, ..., T are T + 1 ordered response categories. A higher category implies a higher level of agreement between the respondent and the item. Then one can assume the following unfolding model P (Yij t | θi, aj) = g dt + θi aj 2 , (2) for t {1, ..., T}, where g : R [0, 1] is a given monotone decreasing function and d1, ..., d T are additional model parameters. It implies that the larger the distance, the smaller the probability for Yij to take a large value. This model is closely related to the graded response model (Samejima, 1997) in item response theory. For another example, consider ranking data consisting of pair-wise comparisons, where each response is a comparison between two items j and j . Following the same idea as above, one may model the probability that item j is preferred over j to take the form g( θi aj 2 θi aj 2). That is, the probability decreases with the difference of their squared distances to person i. Our theoretical results and computational algorithm given below can be adapted to these situations. 2.2 Recovery of Configuration Our main goal is the simultaneous recovery of the ideal points θi and aj, based on the observed binary responses yij, i = 1, ..., N, j = 1, ..., J. Since the model only relies on the Euclidian distance between the ideal points, two sets of points lead to the same model if they have the same configuration, i.e., one set of points can be obtained by applying an isometry mapping to the other. This is because, the distance between points is invariant under an isometry mapping. An isometry mapping F in RK takes the form F(x) = Ox + b, x RK, where O is a K K orthogonal matrix and b is a vector in RK (see, e.g., Olver, 1999). We further denote AK as the set of all isometry mappings on RK. Without additional information, the best possible result one can expect is recovering the ideal points up to an isometry mapping. We refer to this problem as the recovery of ideal point configuration. It is worth noting that regularity conditions are needed to ensure the recovery of the configuration. That is, it is possible that there exist multiple sets of ideal points with different configurations that lead to the same distribution of Yijs. In other words, the configuration of {θ1, ..., θN, a1, ..., a J} may not be unique only given the partial distance matrix. This is known as the situation of degeneration, in which case the visualization does not convey information or can even be misleading. A simple example is given in Figure 2, where the two different configurations in the two panels have the same partial distance matrix. Following the above discussion, the validity of unfolding-model-based visualization relies on the accuracy of configuration recovery, a problem to be discussed. Specifically, we Chen, Ying and Zhang Figure 2: An example of degenerate situation: The triangles represent item points and circles represent person points. The two configurations in R2 share the same partial distance matrix, where d11 = 12, d12 = 32, d21 = 22, d22 = 22, d31 = 32, and d32 = 12. consider the following loss function for configuration recovery, PN i=1 θ i F(ˆθi) 2 PJ j=1 a j F(ˆaj) 2 where θ i and a j denote the true ideal points and ˆθi and ˆaj denote the estimates from data (yij)N J. Note that (3) quantifies the accuracy of configuration recovery in an average sense, where isometry indeterminacy is bypassed by the minimization in (3) with respect to all isometry mappings in AK. We call (3) the average loss for the recovery of ideal point configuration. Error bounds will be established for (3) under reasonable conditions, which ensures the accurate recovery of the loss function when both N and J are large. 2.3 Connection with Other Scaling Methods MDU is closely related to MDS, a class of methods for visualizing the similarity pattern between data points (Borg and Groenen, 2005). More precisely, MDS maps a set of variables onto a low dimensional space, based on data measuring the similarity between variables. As pointed out in Chapter 14, Borg and Groenen (2005), MDU can be viewed as a special case of MDS, where the set of variables in MDS composes of both the respondents and items and the item response data (yij)N J are regarded as measures of similarity between the respondents and the items, while the similarities within the two sets (i.e., respondents and items) are structurally missing; see Figure 3 for an illustration that is a reproduction of Figure 14.1 of Borg and Groenen (2005). Little statistical theory has been developed for the recovery of configuration based on MDS models. The most relevant work is Zhang et al. (2016), in which an error bound is developed for the recovery of the complete distance matrix, under a linear MDS model without structurally missing data. However, little discussion is provided on the recovery of ideal point configuration, under an MDU setting. The recovery of configuration is relatively easier under the setting of MDS with no structurally missing data. This is because, the complete data matrix of similarities will provide sufficient information on the complete distance matrix. The accurate recovery of the complete distance matrix further implies the accurate recovery of configuration under weak conditions, due to the one-to-one relationship between the complete distance matrix Unfolding-Model-Based Visualization: Theory, Method and Applications Figure 3: In MDU, the diagonal blocks are missing. All we observe are the off-diagonal blocks. and the ideal point configuration as described in Proposition 3. Under the MDU setting, the recovery of configuration requires additional regularity conditions, due to the lack of direct measurement of within-set distances. Proposition 3 For {x1, ..., xn} RK, {y1, ..., yn} RK, if xi xj = yi yj for all i and j, then there exists an isometry mapping F AK such that F(xi) = yi for all i = 1, ..., n. MDU is also related to other scaling methods for binary data such as item response theory (IRT; Embretson and Reise, 2000; Reckase, 2009) and multiple correspondence analysis (Gifi, 1990; Le Roux and Rouanet, 2010). Specifically, probabilistic models are available from IRT for multivariate binary data. An IRT model also represents respondents and items by low-dimensional parameter vectors, say θi and aj. It also assumes that the probability of Yij = 1 is a function of θi and aj. In this sense, the model introduced above can be viewed as a special IRT model, in which the probability of Yij = 1 is assumed to be a monotone decreasing function of θi aj . However, the classical IRT models (see e.g., Embretson and Reise, 2000; Reckase, 2009) are not specified in this way. Consequently, it does not make sense to visualize the person and item parameter vectors jointly. Multiple correspondence analysis is an algorithm-based approach that can be applied to binary data and produce low-dimensional scores for both respondents and items. These score vectors can be plotted jointly in the same space. However, as a common issue with algorithm-based approaches, the meaning of the distance between the score vectors is not clear and the uncertainty associated with the visualization is hard to quantify. 3. Theoretical Results 3.1 Configuration Recovery based on Perturbed Partial Distances We first study the recovery of configuration from a perturbed partial distance matrix, when both N and J grow to infinity. Let θ i , i = 1, ..., N, and a j, j = 1, ..., J be the true person and item ideal points in RK, respectively, and let D N,J be the corresponding partial distance Chen, Ying and Zhang matrix. In addition, let θi RK+ and aj RK+ correspond to a perturbed version of the true configuration, satisfying DN,J D N,J 2 F = o(NJ) (4) and K+ K, where DN,J denotes the partial distance matrix given by the perturbed configuration and F denotes the matrix Frobenius norm. One can think of K+ as the latent dimension of the MDU model being applied to data, and θi and aj as some estimates of the person and item ideal points. For the time being, we treat K+, θi, and aj as given. Based on the definition of matrix Frobenius norm, the left side of (4) has NJ terms, each of which is a squared distance between a true person-item distance and its perturbed value. Equation (4) implies that the perturbed partial distance matrix converges to the true one in an average sense, when both N and J grow to infinity. We denote θ+ i = ((θ i ) , 0 ) and a+ j = ((a j) , 0 ) in RK+ as the embedding of the true ideal points in RK+, where 0 denotes a zero vector. In what follows, we show that PN i=1 θ+ i F( θi) 2 PJ j=1 a+ j F( aj) 2 as N and J grow to infinity, under reasonable conditions on the true ideal points. Throughout this paper, we assume that ideal points are constrained in a compact set in RK. A0. There exists a constant M such that θ i M and a j M for all i and j. To impose regularity conditions on the true configuration of the N + J ideal points, which can vary with N and J, we introduce the notion of anchor points, two finite sets of points in RK satisfying certain regularities that are independent of N and J. Definition 4 Two sets of points, {b 1, ..., b k1}, {c 1, ..., c k2} BK 0 (M), are called a collection of anchor points of RK, if they satisfy conditions A1 and A2 below, where BK 0 (M) denotes a closed ball in RK centered at 0 with radius M. Let D = ( b i c j 2)k1 k2 be the partial distance matrix based on the anchor points, whose entries are assumed to be all positive (i.e., there is no identical points). A1. There exists η > 0 such that for any partial distance matrix D Rk1 k2 satisfying D D F < η, D has a unique configuration. A2. Both {b 1, ..., b k1} and {c 1, ..., c k2} can affine span RK. Remark 5 According to Definition 4, we still get a collection of anchor points when slightly perturbing the points in a given anchor point collection in RK. According to condition A1, the anchor points are well-behaved points whose configuration can be uniquely determined by the partial distance matrix, even after a small perturbation. In addition, thanks to A2, the anchor points will help to anchor the rest of Unfolding-Model-Based Visualization: Theory, Method and Applications the points in RK, i.e., determining the configuration of a larger set of respondent and item ideal points. Following the above concept of anchor points, it is intuitive that if there exist anchor points {b 1, ..., b k1} and {c 1, ..., c k2}, satisfying that each b i is surrounded by sufficiently many respondent ideal points and each c j is surrounded by sufficiently many item ideal points; that is, there exist a sufficient number of anchor points. Then it is relatively easy to recover the configuration of the ideal points from a perturbed partial distance matrix. This intuition is formalized by condition A3 below. A3. There exists a collection of anchor points {b 1, ..., b k1} and {c 1, ..., c k2} BK 0 (M) RK and 0 < ϵ < M/10 such that the closed balls with b 1, ..., b k1 and c 1, ..., c k2 as centers and radius ϵ, denoted by Bb 1(ϵ), ..., Bb k1(ϵ) and Bc 1(ϵ), ..., Bc k2(ϵ), do not overlap. The following two conditions are required to hold. (1) For any b1 Bb 1(ϵ), ..., bk1 Bb k1(ϵ) and c1 Bc 1(ϵ), ..., ck2 Bc k2(ϵ), {b1, ..., bk1} and {c1, ..., ck2} are also a collection of anchor points. (2) When N and J grow to infinity, pi = lim inf N PN l=1 1{ θ l b i <ϵ} N > 0, i = 1, ..., k1, qj = lim inf J PJ l=1 1{ a l c j <ϵ} J > 0, j = 1, ..., k2. Theorem 6 Suppose that A0 and A3 are satisfied for the true ideal points θ i and a j, i = 1, ..., N, j = 1, ..., J. Let θi, aj BK+ 0 (M) correspond to a perturbed version of the true configuration, for some K+ K. Further let DN,J be the corresponding partial distance matrix. Suppose that DN,J D N,J 2 F = o(NJ), when N and J grow to infinity. Then lim sup N,J PN i=1 θ+ i F( θi) 2 PJ j=1 a+ j F( aj) 2 where C is a constant that does not depend on N and J. If there exists a fixed collection of anchor points, for which A3 is satisfied for any sufficiently small ϵ > 0, then we have lim sup N,J PN i=1 θ+ i F( θi) 2 PJ j=1 a+ j F( aj) 2 Remark 7 Theorem 6 shows that the configuration can be recovered asymptotically when both N and J grow to infinity and suitable conditions hold. The conditions required by Theorem 6 are quite mild. It first requires all the true and perturbed ideal points to be located in a compact set. Second, as will be shown in Proposition 8 below, condition A3 is satisfied with high probability when the true person and item points are i.i.d. samples from two distributions satisfying mild conditions, respectively. Finally, it requires that the perturbation of the partial distance matrix is not too large, i.e., DN,J D N,J 2 F = o(NJ). As will be shown in Proposition 11, this condition holds with high probability when DN,J is given by a likelihood-based estimator. Chen, Ying and Zhang Proposition 8 Suppose that θ 1, ..., θ N and a 1, ..., a J are independent and identically distributed samples from distributions P1 and P2, where P1 and P2 have positive and continuous density functions within a ball G BK 0 (M). Then A3 holds almost surely for any sufficiently small ϵ > 0. Remark 9 We remark that constant C is determined and only determined by the configuration of the anchor points in A3, according to our proof in the supplementary material. Roughly, the more regular the set of anchor points is (in terms of affine spanning RK), the smaller the value of C. Remark 10 As discussed in Section 2.2, we can only recover the ideal points up to an isometry mapping. This isometry mapping may be fixed if one is willing to make further assumptions such as non-negativity (Donoho and Stodden, 2004; Hoyer, 2004) and sparsity (Chen et al., 2020). In that case, one may further interpret each coordinate of the latent space. We leave this problem for future investigation. 3.2 Likelihood-based Estimation In what follows, we propose a constrained maximum likelihood estimator and show its properties. Given the assumptions of the MDU model, our likelihood function takes the form L(θ1, ..., θN, a1, ..., a J) = j=1 f( θi aj 2)yij(1 f( θi aj 2)1 yij. Based on this likelihood function, we consider the following estimator (ˆθ1, ..., ˆθN, ˆa1, ..., ˆa J) = arg min θ1,...,θN,a1,...,a J RK+ log L(θ1, ..., θN, a1, ..., a J) s.t. θi M, aj M, i = 1, ..., N, j = 1, ..., J. (7) where K+ and M are pre-specified. We denote ˆDN,J as the partial distance matrix based on ˆθis and ˆajs from (7). We impose the following regularity condition on the link function f, which requires f to be neither too steep nor too flat in the feasible domain. Similar conditions are assumed in Davenport et al. (2014) for solving a 1-bit matrix completion problem. A4. The link function f : R (0, 1) is a smooth and monotone decreasing function, satisfying L4M2 < and β4M2 < , where Lα = sup |x| α |f (x)| f(x)(1 f(x)), and βα = sup |x| α f(x)(1 f(x)) Proposition 11 Suppose that A0 and A4 are satisfied and K+ K. Then there exist C1 and C2 independent of N and J, such that 1 NJ ˆDN,J D N,J 2 F C1M2L4M2β4M2 1 + log(NJ) with probability at least 1 C2/(N + J). Unfolding-Model-Based Visualization: Theory, Method and Applications Proposition 11 implies that ˆDN,J D N,J 2 F = op(NJ), which, combined with Theorem 6, leads to Theorem 12 below. Theorem 12 Suppose that A0, A3 and A4 are satisfied and K+ K. Then PN i=1 θ+ i F(ˆθi) 2 PJ j=1 a+ j F(ˆaj) 2 where ˆθi and ˆaj, i = 1, ..., N, and j = 1, ..., J, are given by (7), ϵ is from condition A3, and C is a constant independent of ϵ, N, and J. Remark 13 We remark that if A3 holds for any sufficiently small ϵ, then (8) implies that the loss PN i=1 θ+ i F(ˆθi) 2 PJ j=1 a+ j F(ˆaj) 2 converges to zero in probability. Further note that according to Proposition 8, A3 holds with high probability for any sufficiently small ϵ, under a random design for the true ideal points. Therefore, the loss can be shown to converge to zero in probability, under this random design. This result is summarized in Theorem 14 below. Theorem 14 Suppose that A0 and A4 are satisfied and K+ K. Further suppose that θ 1, ..., θ N and a 1, ..., a J are independent and identically distributed samples from distributions P1 and P2, respectively, where P1 and P2 have positive and continuous density functions within a ball G BK 0 (M). Then for ˆθi and ˆaj, i = 1, ..., N, and j = 1, ..., J, given by (7), the loss function PN i=1 θ+ i F(ˆθi) 2 PJ j=1 a+ j F(ˆaj) 2 goes to 0 in probability as N and J grow to infinity. Remark 15 We remark that the probability measures in Theorems 12 and 14 are slightly different. The probability in Theorem 12 is based on the conditional distribution of Yijs given θ i and a j, while that for Theorem 14 is based on the joint distribution of Yij, θ i and a j, i = 1, ..., N, j = 1, ..., J. Remark 16 A stress function is a squared error loss function that plays an important role in the classical MDS/MDU algorithms. It serves not only as the objective function in the search for the MDS/MDU solution, but also as the basis for assessing the goodness-of-fit of the solution (Mair et al., 2016). In the proposed framework, the negative joint log-likelihood function plays a similar role as the stress function. It replaces the squared loss in the stress function by a loss function based on the Kullback-Leibler divergence. Similar goodness-of-fit measures in classical MDU can be developed under the proposed framework, based on the negative joint log-likelihood. Chen, Ying and Zhang Remark 17 We remark on the choice of latent dimension. Theorems 12 and 14 suggest that as long as we choose K+ to be no less than the true dimension K, then the unfolding result is asymptotically valid. When there is no such prior knowledge about an upper bound of K, one can estimate the latent dimension K using data. Several methods from factor analysis and network data analysis may be adapted to the current problem, such as tracenorm regularization (Bach, 2008), cross-validation (Chen and Lei, 2018; Li et al., 2020), and information criteria (Bai and Ng, 2002). We believe that consistency results on the selection of K can be established. Remark 18 We point out that the result of Proposition 11 can be easily extended to other MDU models, such as models with additional parameters in the link function and models for rating and ranking data. Then, by making use of Theorem 6, the results of Theorems 12 and 14 can also be extended to these models. We propose an alternating minimization algorithm for solving (7). To handle the constraints in (7), a projected gradient descent update is used in each iteration. For x RK+, we define the following projection operator: Proc M(x) = arg min y M y x = ( x if x M, Mx/ x if x > M. Algorithm 1 (Alternating minimization algorithm) Input: Data (yij)N J, pre-specified dimension K+, constraint M, iteration number m = 1, and the initial values θ(0) 1 , ..., θ(0) N and a(0) 1 , ..., a(0) J in RK+. Alternating minimization: at the mth iteration, perform (a) For each respondent i, update θ(m) i = Proc M θ(m 1) i + ϱs(m 1) i (θ(m 1) i ) , s(m 1) i (θ) j=1 yij log f( θ a(m 1) j 2) + (1 yij) log 1 f( θ a(m 1) j 2) The step size ϱ > 0 is chosen by line search. (b) For each item j, update a(m) j = Proc M a(m 1) j + ϱ s(m 1) j (a(m 1) j ) , s(m 1) j (a) i=1 yij log f( θ(m) i a 2) + (1 yij) log 1 f( θ(m) i a 2) ! The step size ϱ > 0 is chosen by line search. Unfolding-Model-Based Visualization: Theory, Method and Applications Iteratively perform steps (a) and (b) until convergence. Let m be the last iteration number upon convergence. Output: ˆθ1 = θ(m ) 1 , ..., ˆθN = θ(m ) N and ˆa1 = a(m ) 1 , ..., ˆa J = a(m ) J . Remark 19 Since (7) is not a convex optimization problem, there is no guarantee that Algorithm 1 finds the global optimal solution. However, we point out that the previous theoretical results hold even when {ˆθ1, ..., ˆθN, ˆa1, ..., ˆa J} is not a global optimal point. Specifically, Proposition 11 and Theorems 12 and 14 hold for any {ˆθ1, ..., ˆθN, ˆa1, ..., ˆa J} satisfying the constraints in (7) and L(ˆθ1, ..., ˆθN, ˆa1, ..., ˆa J) L(θ 1, ..., θ N, a 1, ..., a J). (9) According to our simulation study, estimates given by Algorithm 1 are likely to satisfy (9). 3.3 Analyzing Missing Data We further discuss the configuration recovery problem when data have many missing values, which is commonly encountered in practice. Denote matrix Ω= (ωij)N J, where ωij = 1 indicates that response yij is observed and ωij = 0 indicates yij is missing. We consider the simple case of uniformly missing, as described in condition A5. We point out that this assumption can be relaxed to analyzing data that have non-uniformly missing entries, following the developments in Cai and Zhou (2013) for solving a 1-bit matrix completion problem. A5. Entries of Ω, ωij, are independent and identically distributed Bernoulli random variables with P(ωij = 1) = n NJ . Under this condition, there are on average n entries of the data matrix (yij)N J that are observable. Thanks to the ignorable missingness, given Ωand the observed data, the likelihood becomes LΩ(θ1, ..., θN, a1, ..., a J) = Y ωij=1 f( θi aj 2)yij(1 f( θi aj 2)1 yij. We still consider a constrained maximum likelihood estimator (ˆθ Ω 1 , ..., ˆθ Ω N, ˆaΩ 1 , ..., ˆaΩ J ) = arg min θ1,...,θN,a1,...,a J RK+ log LΩ(θ1, ..., θN, a1, ..., a J) s.t. θi M, aj M, i = 1, ..., N, j = 1, ..., J. (10) Let ˆDΩ N,J denote the partial distance matrix for ˆθ Ω 1 , ..., ˆθ Ω N, ˆaΩ 1 , ..., ˆaΩ J . Proposition 20 presents a missing-data version of Proposition 11. It implies that we can still recover the partial distance matrix if n is large enough. Chen, Ying and Zhang Proposition 20 Suppose that A0, A4 and A5 are satisfied and K+ K. Then there exist C1 and C2 independent of N and J, such that 1 NJ ˆDΩ N,J D N,J 2 F C1M2L4M2β4M2 1 + NJ log(NJ) n(N + J) (11) with probability at least 1 C2/(N + J). Remark 21 If n > (N + J) log(NJ), then the right side of (11) goes to 0 as N and J grow to infinity, which means ˆDΩ N,J D N,J 2 F = op(NJ). Following the discussion in Sec- tion 3, {ˆθ Ω 1 , ..., ˆθ Ω N, ˆaΩ 1 , ..., ˆaΩ J } provides a consistent estimate of the ideal point configuration. This consistency result is summarized in Proposition 22, which is a missing-data version of Theorem 14 under the random design. Proposition 22 Suppose that A0, A4 and A5 are satisfied, and K+ K, and n > (N + J) log(NJ). Further suppose that θ 1, ..., θ N and a 1, ..., a J are independent and identically distributed samples from distributions P1 and P2, where P1 and P2 have positive and continuous density functions within a ball G BK 0 (M). Then the loss function PN i=1 θ+ i F(ˆθ Ω i ) 2 PJ j=1 a+ j F(ˆaΩ j ) 2 goes to 0 in probability as N and J grow to infinity. 4. Simulation Studies In what follows, simulation studies are conducted to verify our theoretical results. Specifically, we consider a random design where the true ideal points are generated from distributions. All the analyses in this section, as well as those in Section 5, are based on our implementation of Algorithm 1 in statistical software R1. 4.1 Study I Setting. We first consider a setting where K+ is chosen to be exactly K. We consider MDU in a two-dimensional latent space, i.e., K = K+ = 2. Diverging sequences of J and N are considered, by letting J = 200, 400, ..., 1000 and N = 20J. For given N and J, 100 independent datasets are generated. For each dataset, we first sample θ i s and a js uniformly from B2 0(1), a ball in R2 with center 0 and radius 1. Then given the ideal points, response data Yij are generated under the link function f(x) = 2/(1 + exp(x + 0.1)). It can be easily verified that condition A4 is satisfied for this link function. For each dataset, we obtain an estimate of the ideal points, by applying Algorithm 1 ten times with random starting points and then choosing the result that gives the largest likelihood function value. The use of multiple starting points substantially reduces the risk of the algorithm converging to bad local minima. In the application of Algorithm 1, the constraint M is set to 1.5. 1. An R package has been developed and can be downloaded from https://github.com/hrzhang16/mmdu. Unfolding-Model-Based Visualization: Theory, Method and Applications J = 200 J = 400 J = 600 J = 800 J = 1000 25% 0.0630 0.0323 0.0218 0.0164 0.0131 median 0.0647 0.0328 0.0222 0.0167 0.0134 75% 0.0666 0.0336 0.0227 0.0170 0.0135 Table 1: Simulation Study I: The average squared Frobenius loss of partial distance when J increases from 200 to 1000. For each J, the table shows the 25%, 50% and 75% quantiles of the loss based on 100 independent experiments. J = 200 J = 400 J = 600 J = 800 J = 1000 25% 0.0158 0.0079 0.0053 0.0040 0.0032 median 0.0160 0.0080 0.0053 0.0040 0.0032 75% 0.0162 0.0080 0.0054 0.0040 0.0032 Table 2: Simulation Study I: The average loss for configuration recovery when J increases from 200 to 1000. For each J, the table shows the 25%, 50% and 75% quantiles of the loss based on 100 independent experiments. Results. We first check the obtained likelihood function values for the 100 datasets. As we point out in Remark 19, Proposition 11 and Theorems 12 and 14 still hold as long as the estimate satisfies (9), even if the global solution to the optimization (7) is not obtained. It is found that by using ten random starting points, the likelihood function at the estimated parameters is always larger than that at the true parameters for all the 100 datasets. We then present the average squared Frobenius loss for the recovery of the partial distance matrix, ˆDN,J D N,J 2 F /(NJ). These results are given in Table 1 which presents the 25%, 50%, and 75% quantiles of the loss based on the 100 datasets. From this table, we see that the loss tends to decrease as the sample size increases, supporting the result of Proposition 11. Table 2 presents the results on loss (3) for configuration recovery, where the best isometry mapping F in (3) is obtained by solving an optimization problem given the true and estimated ideal points. Similar to the results on partial distance matrix recovery, the loss (3) also decreases towards 0 as J grows large, which is consistent with the result of Theorem 14. Finally, the computation time on a standard desktop machine2 for solving (7) is shown in Table 3. It is worth pointing out that since the update of person and item parameters in each iteration of Algorithm 1 can be run in parallel, the computation can be further speeded up substantially by parallel computing. 4.2 Study II Setting. We now consider a setting where K+ > K. We take the same setting as in Study I, except that we set K+ = 3 when fitting the MDU model. The same as Study I, for each 2. All the computation is conducted on a single Intel Gold 6130 core. Chen, Ying and Zhang J = 200 J = 400 J = 600 J = 800 J = 1000 25% 98.2 109.6 144.1 191.5 254.6 median 113.8 120.1 156.0 201.5 272.6 75% 128.9 138.0 176.9 213.9 286.7 Table 3: Simulation Study I: The computation time of optimization (7) when J increases from 200 to 1000. For each J, 25%, 50% and 75% quantiles of the computation time from 100 independent experiments are shown. J = 200 J = 400 J = 600 J = 800 J = 1000 25% 0.0734 0.0384 0.0261 0.0198 0.0159 median 0.0758 0.0390 0.0265 0.0200 0.0161 75% 0.0780 0.0398 0.0269 0.0204 0.0163 Table 4: Simulation Study II: The average squared Frobenius loss of partial distance when J increases from 200 to 1000. For each J, the table shows the 25%, 50% and 75% quantiles of the loss based on 100 independent experiments. pair of N and J, 100 independent datasets are generated. For each dataset, Algorithm 1 is applied similarly, using 10 random starting points and constraint parameter M = 1.5. Results. The results are given in Tables 4 through 6. Similar to Tables 1 3, these three tables also show the results on partial distance matrix recovery, configuration recovery, and computation time, respectively. Comparing with the results of Study I, we see that both losses for the recovery of partial distance matrix and configuration tend to be larger. This is due to the overfitting brought by adding unnecessary parameters in the model. The computation time also increases compared with that of Study I. 5. Real Examples 5.1 Example I: Movie Data Background. We apply MDU to a movie rating dataset from the famous Movie Lens project (see e.g., Harper and Konstan, 2016). The dataset analyzed in this paper is a subset J = 200 J = 400 J = 600 J = 800 J = 1000 25% 0.0853 0.0568 0.0452 0.0386 0.0343 median 0.0862 0.0573 0.0455 0.0390 0.0345 75% 0.0877 0.0580 0.0459 0.0392 0.0346 Table 5: Simulation Study II: The average loss for configuration recovery when J increases from 200 to 1000. For each J, the table shows the 25%, 50% and 75% quantiles of the loss based on 100 independent experiments. Unfolding-Model-Based Visualization: Theory, Method and Applications J = 200 J = 400 J = 600 J = 800 J = 1000 25% 106.3 264.0 639.3 1294.0 2302.5 median 110.0 286.5 698.8 1407.7 2480.8 75% 112.9 308.9 793.8 1551.0 2841.1 Table 6: Simulation Study II: The computation time of optimization (7) when J increases from 200 to 1000. For each J, 25%, 50% and 75% quantiles of the computation time from 100 independent experiments are shown. of a benchmark Movie Lens dataset collected during a seven-month period from September, 1997 through April, 19983. This subset contains 943 users and 338 movies, obtained by selecting movies that have been rated by at least 100 users. Unlike many analyses of Movie Lens data that focus on the rating scores, we consider to unfold the rating behavior itself (i.e., rated/not rated) which may also reveal the users preference patterns. More precisely, we let Yij = 1 if movie j has been rated by user i and Yij = 0 otherwise. Analysis. For visualization purpose, we unfold the data onto a two-dimensional space. To apply the MDU model introduced in this paper, we need to specify the link function f. We assume f to take the logistic form f(x) = 2/(1 + exp(x + δ)), where δ is a pre-specified small positive constant. For any δ > 0, it is easy to check that the regularity condition A4 is satisfied. The results presented below are based on the choice δ = 0.1, but we point out that other choices of δ (δ = 0.05, 0.15, 0.2) have also been tried which all lead to very similar results. The constraint constant M is set to 3.5 when applying Algorithm 1. After obtaining the estimate, we transform the estimated ideal points by an isometry mapping, so that the x-axis corresponds to the dimension along which the estimated movie ideal points have the highest variance. As will be described in the sequel, under this isometry mapping of the estimated ideal points, both the xand y-axes receive good interpretations. Results. The results from the MDU analysis are presented in Figures 4 through 6. Figure 4 jointly visualizes the estimated movie and user points. As we can see, the movies and the users tend to form two giant clusters that only slightly overlap. We investigate the movie points. First, the y-axis of the space largely indicates, if not perfectly, the popularity of the movies. The movies with a smaller ˆaj2 value tends to be rated more frequently. Roughly speaking, the shorter the average distance from a movie to the user points, the more often the movie is rated. In fact, the Kendall s tau rank correlation between ˆaj2s and the numbers of ratings received by the movies is 0.66. This phenomenon is further reflected by panel (a) of Figure 5, where movies are stratified by the numbers of ratings they received into four categories. These four categories tend to be ordered along the y-axis. We list four movies as examples, as indicated in panel (a) of Figure 5. From the top to the bottom, they are Batman Forever (1995), Golden Eye (1995), Get Shorty (1995) and The Godfather (1972), respectively. Based on our interpretation of the y-axis, these four movies are ordered from the least popular to the most popular. 3. The dataset can be downloaded from https://grouplens.org/datasets/movielens/100k/ Chen, Ying and Zhang Figure 4: Analysis of movie rating data: Simultaneous visualization of the estimated movie and user points. Second, the x-axis of the space seems to indicate the release time of the movies. The Kendall s tau correlation between ˆaj1s and the release dates of the movies is -0.70. As shown in panel (b) of Figure 5, where the movies are stratified into three categories, namely before 1995 , 1995-1996 , and 1997-1998 . According to this figure, the clustering patten of the movies can be largely explained by the three categories based on the movie release dates. From the right to the left of the space, the points correspond to movies from the relatively older ones to the relatively more recent ones. For example, the three movies indicated in panel (b) of Figure 5 are, from right to left, Citizen Kane (1941), Twelve Monkeys (1995) and The Devil s Own (1997), respectively. The interpretation of the latent space based on movies facilitates the interpretation of the user points. First, the y-axis corresponds to the users activeness. Roughly speaking, the shorter the average distance from a user point to the movies points, the more active the user is. The Kendall s tau rank correlation between ˆθi2s and the numbers of ratings given by the users is 0.73. This is further shown via Figure 6, where users are classified into four equal-size groups depending on the number of movies they rated. These groups of users, from the most active one to the least active one, lie from the top to the bottom. Second, based on the alignment of movies along the x-axis, the user points from right to left may be interpreted as the ones who tend to more frequently rate relatively older movies to the ones who tend to more frequently rate relatively more recent ones. 5.2 Example II: Senate Roll Call Voting Data Background. We now analyze a senate roll call voting dataset from the 108th congress. This dataset contains the voting records from 100 senators to 675 roll calls in years 2003 Unfolding-Model-Based Visualization: Theory, Method and Applications 2 1 0 1 2 3 1.0 0.5 0.0 0.5 1.0 1.5 <=127 128 169 170 229 >=230 examples 2 1 0 1 2 3 1.0 0.5 0.0 0.5 1.0 1.5 1997 1998 1995 1996 before 1995 examples Figure 5: Analysis of movie rating data. Panel (a): Visualization of movie points, with movies stratified into four equal-size categories based on the numbers of rating. Movies with numbers of rating less than 127, 128-169, 170-229 and more than 230 are indicated by black, red, green and blue points, respectively. Yellow points represent example movies. Panel (b): Visualization of movie points, with movies stratified into three categories based on their release time. Movies released in 1997-1998, 1995-1996, and before 1995 are indicated by green, red and black points, respectively. Purple points represent example movies. Chen, Ying and Zhang 4 3 2 1 0 1 2 2.5 1.5 0.5 0.5 1.0 <=24 25 47 48 103 >=104 Figure 6: Analysis of movie rating data: Visualization of user points, with users classified into four equal-size categories based on the numbers of rating. Users who rated less than 24, , 25-47, 48-103 and more than 104 movies are indicated by black, red, green and blue points, respectively. and 20044. Among the 100 senators, there are 48 from the Democratic party, 51 from the Republican party, and one independent politician. For each roll call j, the vote of senator i is recorded in three ways, Yea , Nay and Not Voting , treated as Yij = 1, 0, and missing, respectively. Analysis. Similar analysis as the previous one is conducted. Specifically, we unfold the data into a two-dimensional space. The same link function f and constraint constant M are adopted as in the analysis of movie data. After getting the estimate, we transform the estimated ideal points by an isometry mapping, so that the x-axis corresponds to the dimension along which the estimated senate ideal points have the highest variance. Results. The results are presented in Figures 7 through 9. In Figure 7, the ideal points of both roll calls and senators are visualized simultaneously. As we can see, most of the roll calls and all the senators tend to lie around a one-dimensional line. This visualization is still valid, in the sense that even when the true latent dimension is one, according to Theorem 14, unfolding the data in a two-dimensional space is still consistent. This phenomenon of degeneration is quite consistent with the overall unidimensional patten in the congress voting data throughout the history. It has been well recognized in the political science literature (Poole et al., 1991; Poole and Rosenthal, 1991) that senate voting behavior is essentially unidimensional, though slightly different latent space models are used in that literature. For example, Poole et al. (1991) concluded that to the extent 4. The dataset can be downloaded from https://legacy.voteview.com/dwnl.htm. Unfolding-Model-Based Visualization: Theory, Method and Applications senator roll call Figure 7: Analysis of senator roll call data: Simultaneous visualization of the estimated senator and roll call ideal points. that congressional voting can be described by a spatial model, a unidimensional model is largely (albeit not entirely) sufficient. We first interpret the senators. In Figure 8, all the senator points are visualized with their party membership indicated by different point types. In Table 7, we rank the senators based on their value of ˆθi1, which is presented along the x-axis. According to this table, the Democrats tend to lie on the left and the Republicans tend to be on the right. In fact, this ranking is largely consistent with National Journal s liberalness ranking of the senators in 2003. National Journal s ranking result, which is replicated in Clinton et al. (2004b), is obtained by unfolding the senators votes on 62 key roll calls using a model given in Clinton et al. (2004a). The Kendall s tau rank correlation between the result in Table 7 and that given by the National Journal is 0.79. In fact, Senator John Kerry is ranked the most liberal by both our model and by National Journal and Senator Craig L. Thomas, who is the most conservative senator according to the ranking of National Journal, is the third most conservative senator given by our model. From Figure 8 and Table 7, it is also worth noting that there is a Democrat whose estimated ideal point is mixed together with those of the Republicans. This senator is Zell Miller from the state of Georgia. He is a conservative Democrat and in fact, he supported Republican President George W. Bush against the Democratic nominee John Kerry in the presidential election in 2004. In this congress, there is an independent senator, Jim Jeffords from the state of Vermont, who does not belong to either of the two major parties. As we can see from both Figure 8 and Table 7, his ideal point lies on the left, mixed with many ideal points of the Democrats. This is also consistent with Senator Jim Jeffords political standing. In fact, he left Republican party to become an independent and began caucusing with the Democrats since 2001. Chen, Ying and Zhang Name State Name State Name State 1 Kerry D-MA 35 Johnson D-SD 69 Grassley R-IO 2 Sarbanes D-MD 36 Lieberman D-CT 70 Bond R-MO 3 Reed D-RH 37 Bingaman D-NM 71 Roberts R-KA 4 Harkin D-IO 38 Nelson D-FL 72 Gregg R-NH 5 Graham D-FL 39 Dorgan D-ND 73 Allen R-VI 6 Lautenberg D-NJ 40 Conrad D-ND 74 Domenici R-NM 7 Edwards D-NC 41 Carper D-DE 75 Bennett R-UT 8 Kennedy D-MA 42 Pryor D-AR 76 Dole R-NC 9 Durbin D-IL 43 Bayh D-IN 77 Frist R-TN 10 Levin D-MI 44 Lincoln D-AR 78 Brownback R-KA 11 Akaka D-HA 45 Landrieu D-LO 79 Hatch R-UT 12 Byrd D-WE 46 Baucus D-MT 80 Cochran R-MS 13 Boxer D-CA 47 Breaux D-LO 81 Graham R-SC 14 Corzine D-NJ 48 Nelson D-NE 82 Alexander R-TN 15 Clinton D-NY 49 Chafee R-RH 83 Lott R-MS 16 Leahy D-VE 50 Snowe R-ME 84 Chambliss R-GE 17 Dodd D-CT 51 Collins R-ME 85 Burns R-MT 18 Stabenow D-MI 52 Specter R-PE 86 Bunning R-KE 19 Mikulski D-MD 53 Mccain R-AZ 87 Crapo R-ID 20 Feingold D-WI 54 Dewine R-OH 88 Mcconnell R-KE 21 Rockefeller D-WE 55 Campbell R-CO 89 Ensign R-NV 22 Hollings D-SC 56 Smith R-OR 90 Cornyn R-TX 23 Kohl D-WI 57 Coleman R-MN 91 Sununu R-NH 24 Inouye D-HA 58 Warner R-VI 92 Santorum R-PE 25 Schumer D-NY 59 Murkowski R-AK 93 Craig R-ID 26 Cantwell D-WA 60 Voinovich R-OH 94 Inhofe R-OK 27 Dayton D-MN 61 Hutchison R-TX 95 Allard R-CO 28 Murray D-WA 62 Lugar R-IN 96 Enzi R-WY 29 Wyden D-OR 63 Miller D-GE 97 Sessions R-AL 30 Daschle D-SD 64 Fitzgerald R-IL 98 Thomas R-WY 31 Biden D-DE 65 Talent R-MO 99 Kyl R-AZ 32 Feinstein D-CA 66 Hagel R-NE 100 Nickles R-OK 33 Jeffords I-VE 67 Stevens R-AK 34 Reid D-NV 68 Shelby R-AL Table 7: Analysis of senator roll call data: Ranking of senators based on ˆθi1. Unfolding-Model-Based Visualization: Theory, Method and Applications 0.5 0.0 0.5 0.4 0.2 0.0 0.2 0.4 Democrat Republican independent Figure 8: Analysis of senator roll call data: Visualization of senator points, where senators are classified by their party membership. Specifically, The Democrats, Republicans and an independent politician are indicated by blue, red, and green, respectively. We now investigate the roll calls. The value of ˆaj1, i.e., the roll calls coordinate on the x-axis, seems to represent the roll calls liberalness-conservativeness. The more liberal roll calls lie on the left and the more conservative ones lie on the right. This interpretation is further confirmed by the voting records for the roll calls. In particular, for each roll call, we calculate the proportion of Republicans among the senators who voted Yea . A larger value of this proportion indicates that the roll call is more conservative. As we can see from panel (a) of Figure 9, for roll calls from the left to the right, this proportion increases. In fact, the Kendall s tau rank correlation between ˆaj1s and the proportions of Yea from Republicans is as high as 0.88. We present the content of three roll calls as representative examples. As indicated in panel (a) of Figure 9, these roll calls have substantially different coordinates along the x-axis. From left to right, they are (1) To improve the availability of contraceptives for women , (2) Confirmation Thomas J. Ridge, of Pennsylvania, to be Secretary of Homeland Security , and (3) To provide financial security to family farm and small business owners by ending the unfair practice of taxing someone at death . Although most of the roll calls lie near the x-axis (i.e., ˆaj2 0), there are still quite a few roll calls which spread out on the y-axis. It seems that the voting on such roll calls is heterogeneous within both parties. Specifically, we measure heterogeneity of voting within each party by a cross entropy measure, defined as CE(i) j = p(i) jy log p(i) jy p(i) jn log p(i) jn p(i) jm log p(i) jm, where i = 1, 2 indicate Democrat and Republican, respectively, and p(i) jy , p(i) jn, and p(i) jm denote the proportions of Yea , Nay , and Not voting within the party for the jth roll call. Cross entropy is a commonly used measure of heterogeneity (Chapter 9, Friedman Chen, Ying and Zhang <0.068 0.068 0.52 0.52 0.73 >0.73 examples 0.0 0.2 0.4 0.6 0.8 Figure 9: Analysis of senator roll call data. Panel (a): Visualization of roll call points, where roll calls are classified by the proportion of Yea from Republicans. Specifically, roll calls who have the proportions less than 0.068, 0.068-0.52,0.52-0.73 and larger than 0.73 are indicated by black, red, green and blue points, respectively. The yellow solid points are example roll calls to be discussed. Panel (b): Box plots of min{CE(1) j , CE(2) j }, for roll calls lying near the x-axis (|ˆaj2| 0.05) one the left and for those spreading out along the y-axis (|ˆaj2| > 0.05) on the right. et al., 2001). The larger the cross entropy, the more heterogeneous voting behavior within a party. In panel (b) of Figure 9, we present the box plots of min{CE(1) j , CE(2) j }, for roll calls lying near the x-axis (|ˆaj2| 0.05) and for those spreading out along the yaxis (|ˆaj2| > 0.05). According to panel (b) of Figure 9, the roll calls in the latter group (|ˆaj2| > 0.05) tend to have a larger value of min{CE(1) j , CE(2) j }, implying that the voting tends to be more heterogeneous within both parties for these roll calls. The latter group contains roll calls, such as To provide for the distribution of funds under the infrastructure performance and maintenance program , To enhance the role of Congress in the oversight of the intelligence and intelligence-related activities of the United States Government , and To strike provisions relating to energy tax incentives . Many of such roll calls may be explained by constituency specific factors. 6. Concluding Remarks In this paper, we provide a statistical framework for studying unfolding-model-based visualization. An estimator, together with an algorithm for its computation, is proposed, whose performance is examined by simulation studies. Under reasonable conditions, we provide asymptotic results for the recovery of ideal-point configuration. The proposed method is applied to two datasets, one on movie rating and the other on senator voting, for which interpretable results are obtained. Unfolding-Model-Based Visualization: Theory, Method and Applications The ideal points obtained from the proposed method can be used in further analysis. For example, one can use the estimated person points as covariates in regression analysis. For another example, one may further conduct cluster analysis on the respondents and items, for example, by applying the K-means algorithm (Mac Queen, 1967). In fact, as discussed in the supplementary material, there is a connection between our unfolding model and the stochastic co-blockmodel (Choi and Wolfe, 2014; Rohe et al., 2016) for bi-cluster analysis. When data follow a stochastic co-blockmodel, then our consistency result for the unfolding model further guarantees the consistency of bi-cluster analysis. The current analysis may be extended along multiple directions. First, the current analysis keeps the latent dimension K fixed. In fact, the theoretical results established in this paper can be generalized to a setting where K also diverges, a more appropriate setting for data of a very large scale. Second, it is possible to make statistical inference about the person and item ideal points, such as testing whether a person point is closer to one item point than another. Making statistical inference under our model is closely related to statistical inference for low-rank matrix completion (see e.g., Chen et al., 2019; Xia and Yuan, 2019), but the non-linear link function in our model brings more challenges and thus methods and theory remain to be developed. Third, although we focus on binary data in this paper, the proposed modeling framework, theory and computational algorithm can be extended to other types of data, such as ratings and rankings. Finally, it may also be of interest to extend the current framework to the modeling and analysis of large-scale preferential choice data with informatively missing data entries. Acknowledgement The authors thank the editor and reviewers for their supportive and insightful comments. The authors also thank Prof. Ming Yuan for a fruitful discussion and Prof. Abdo Alfakih for suggesting references on Euclidian distance matrix completion. We would like to acknowledge support for this project from National Academy of Education/Spencer Postdoctoral Fellowship and NSF Grants DMS-2015417, SES-1826540 and IIS-1633360. Appendix A. Bi-Cluster Analysis The applications of multidimensional scaling, including multidimensional unfolding as a special case, are often followed by cluster analysis (e.g., Kruskal and Wish, 1978; Borg and Groenen, 2005) for better understanding and interpretation of the data visualization. In our context, it is often of interest to cluster the respondents and the items, respectively. This task is known as bi-clustering or co-clustering (Hartigan, 1972; Dhillon, 2001), which is often studied statistically under the stochastic co-blockmodel (Choi and Wolfe, 2014; Rohe et al., 2016), an extension of the widely used stochastic blockmodel (Holland et al., 1983). Following multidimensional unfolding, it is natural to bi-cluster the respondents and the items based on the estimated ideal points, using the Euclidian distance as a natural measure of dissimilarity. In particular, we use the K-means algorithm (Mac Queen, 1967) to cluster the respondents and the items into k1 and k2 clusters, respectively, for some pre-specified numbers of clusters k1 and k2. This two-step procedure for bi-cluster analysis is described in Algorithm 2. Chen, Ying and Zhang Algorithm 2 (Two-step procedure for bi-cluster analysis) Step 1: Apply Algorithm 1 and obtain estimates {ˆθ1, ..., ˆθN, ˆa1, ..., ˆa J}. Step 2: Perform the K-means algorithm to {ˆθ1, ..., ˆθN} and {ˆa1, ..., ˆa J} given k1 and k2 clusters, respectively. Output: The cluster membership of respondents ˆϑi {1, ..., k1} and cluster membership of items ˆυj {1, ..., k2} (i = 1, ..., N; j = 1, ..., J). We provide a connection between the multidimensional unfolding model studied in this paper and the stochastic co-blockmodel. Consider a special case under the multidimensional unfolding model, where there are finite possible locations for the respondent ideal points and also for the item ideal points, independent of N and J. We denote the possible locations for the respondent ideal points as {b 1, ..., b k1} and denote those for the item ideal points as {c 1, ..., c k2}. Under this setting, there exist k1 respondent latent classes and k2 item latent classes, regarding two respondents/items as from the same latent class when they have the same location. We denote ϑ i {1, ..., k1} and υ j {1, ..., k2} the true latent class memberships of respondent i and item j, respectively. In this sense, the model becomes a stochastic co-blockmodel, for which the distribution of Yij is only determined by the latent class memberships of respondent i and item j and Yijs are conditionally independent given all the latent memberships of the respondents and items. In what follows, we show that the proportions of misclassified respondents and items converge to 0 in probability, when both N and J grow to infinity, if the K-means algorithm in Algorithm 2 has converged to the global optima. Theorem 23 Suppose A0, A3 and A4 are satisfied, and K+ K. Further suppose the multidimensional unfolding model degenerates to a stochastic co-blockmodel, satisfying θ i {b 1, ..., b k1} and a j {c 1, ..., c k2}. If both K-means algorithms in Algorithm 2 converge to the global optima, then the clustering result satisfies PN i=1 1{ˆϑi=ζ(ϑ i )} N , max ζ Bk2 PJ j=1 1{ˆυj=ζ(υ j )} J goes to 1 in probability as both N and J grow to infinity, where Bk denotes the set of all permutations on {1, ..., k}, for k = k1, k2. Remark 24 To handle label switching indeterminacy in clustering, in the loss function (12) we find permutations that best match the true latent class memberships and their estimates for both the respondents and the items. Appendix B. Proof of Theoretical Results B.1 Definitions and Notations In this appendix, we use c, C, C1, C2 to represent constants which do not depend on N, J, the values of which may vary according to the context. With a little abuse of notation, we Unfolding-Model-Based Visualization: Theory, Method and Applications use AN,J to denote the specified events, which may differ in different proofs. For x RK, we use BK x (C) to denote the closed ball in RK centered at x with radius C. Unless otherwise specified, all balls in the appendix is assumed to be closed. For a set G RK, let int(G) denote the set of all its interior points. For a positive integer n, we denote [n] := {1, ..., n}. We start with some notions which will be used in the proof of theorems, propositions and lemmas. Definition 25 For points xi, x i RK, i = 1, ..., n, we write (x1, ..., xn) (x 1, ..., x n), if there exists an isometry F AK, such that x i = F(xi), i = 1, ..., n. Remark 26 It is easy to show that is an equivalence relation. Definition 27 (Configuration) We define an n-point configuration as an equivalence class. That is, we define a configuration [x1, ..., xn] := {(x 1, ..., x n) : (x 1, ..., x n) (x1, ..., xn)} as the equivalence class of (x1, ..., xn). Remark 28 By the property of isometry mapping, it is easy to see that all the elements in the same configuration have the same distance matrix. We now consider the space of all n-point configurations in RK, denoted by Hn,K := [x1, ..., xn] : xi RK, i = 1, ..., n . For two configurations τ1 = [x1, ..., xn], τ2 = [y1, ..., yn] Hn,K, we define d(τ1, τ2) := inf F AK 1 i n F(xi) yi 2. First, we note that d( , ) is a well-defined mapping from Hn,K Hn,K to R. That is, for any (x 1, ..., x n) [x1, ..., xn] and (y 1, ..., y n) [y1, ..., yn], 1 i n F(xi) yi 2 = inf F AK 1 i n F(x i) y i 2. Second, we notice that d( , ) is a metric on Hn,K, as summarized in Lemma 29 below. Lemma 29 d( , ) is a metric on Hn,K. Remark 30 For [x1, ..., xn] Hn,K, we have [(x 1 , 0) , ..., (x n , 0) ] Hn,K+1 in which sense we can say Hn,K Hn,K+1. Thus Hn,K1 Hn,K2 if K1 K2. For τ1 = [x1, ..., xn] Hn,K1, τ2 = [y1, ..., yn] Hn,K2, the d(τ1, τ2) is defined in the same way by seeing both τ1 and τ2 as elements in Hn,max{K1,K2}. Chen, Ying and Zhang We further denote Pa,b,K as the set of a b partial distance matrices for configurations in RK : Pa,b,K := ( xi yj 2)a b : [x1, ..., xa, y1, ..., yb] Ha+b,K . It is easy to check that Pa,b,K Pa,b,K+1. For A1, ..., An RK, denote [A1, ..., An] as a subset of Hn,K : [A1, ..., An] := {[x1, ..., xn] : xi Ai, i = 1, ..., n}. For A, B Hn,K, the distance between A and B is defined as d(A, B) := inf τ1 A,τ2 B d(τ1, τ2). (13) We further denote Hn,K,C := {[x1, ..., xn] Hn,K : xi C} as a compact subset of Hn,K, and Pa,b,K,C := ( xi yj 2)a b : [x1, ..., xa, y1, ..., yb] Ha+b,K,C (14) as a compact subset of Pa,b,K. We consider a mapping defined as following: Φa,b,K : R(a+b) K Pa,b,K, (x1, ...., xa+b) 7 D, where D is the a b partial distance matrix of {(x1, ..., xa), (xa+1, ..., xa+b)}. It is not difficult to check that Φa,b,K is invariant with respect to isometry. Then, for τ = [x1, ..., xa+b], we denote Φa,b,K(τ) := Φa,b,K(X), where X = (x1, ..., xa+b). Having introduced the notions above, we give the following lemma, which is crucial to the proof of Theorem 6. It essentially shows that for any partial distance matrix D Pk1,k2,K+,M that approximates to another partial distance matrix D Pk1,k2,K,M, whose configuration τ contains a collection of anchor points, then any configuration τ of D will also approximate to τ. Lemma 31 For compact subsets B1, ..., Bk1, C1, ..., Ck2 BK 0 (M), let B = [B1, ..., Bk1, C1, ..., Ck2]. Suppose that for any (x1, ..., xk1+k2) B1 Bk1 C1 Ck2, {x1, ..., xk1} and {xk1+1, ..., xk1+k2} are a collection of anchor points in RK. Then, for any ϵc > 0, there exists ϵd > 0 such that for any τ Hk1+k2,K+,M and τ B satisfying Φk1,k2,K+(τ ) Φk1,k2,K+(τ) F < ϵd, we have d(τ , τ) < ϵc. Unfolding-Model-Based Visualization: Theory, Method and Applications We end this section by the following lemma, which will also be used in the proof of Theorem 6. Lemma 32 Suppose {b 1, ..., b k1}, {c 1, ..., c k2} BK 0 (C) are a collection of anchor points in RK. Then, for any x BK 0 (C), the {x, b 1, ..., b k1}, {c 1, ..., c k2} are also a collection of anchor points in RK. B.2 Proof of Theorems Proof [Proof of Theorem 6] We first show the proof of (5). For ϵ which is given in condition A3, there exist constant pϵ (0, 1), and balls of radius ϵ in RK, denoted by B1(ϵ), ..., Bk1(ϵ), G1(ϵ), ..., Gk2(ϵ), such that for N, J large enough, PN l=1 1{θ l Bb i (ϵ), θl Bi(ϵ)} N > pϵ, i = 1, ..., k1, PJ l=1 1{a l Bc i (ϵ), al Gi(ϵ)} J > pϵ, i = 1, ..., k2. This comes straightforwardly from condition A0 and requirement (2) of anchor points in condition A3. Note that the centers of Bk(ϵ) and Gl(ϵ) may vary through N, J. We also use B k(ϵ) and G l (ϵ) to denote Bb k(ϵ) and Bc l (ϵ), respectively. We first focus on the set of person points k=1 {i [N] : θ i B k(ϵ), θi Bk(ϵ)} and the set of item points l=1 {j [J] : a j G l (ϵ), aj Gl(ϵ)}. Let θ+ i = (θ i ) , 0 , a+ j = (a j) , 0 RK+. We will show that there exists an isometry mapping FN,J AK+, under which FN,J( θi) θ+ i and FN,J( aj) a+ j , for all i I1(ϵ) and j I2(ϵ). This is formalized in the following lemma. Lemma 33 For N, J large enough, there exists an isometry FN,J AK+, such that FN,J(x) 4M, for all x BK+ 0 (M), and for all i I1(ϵ) and for all j I2(ϵ), FN,J( θi) θ+ i 5ϵ, and FN,J( aj) a+ j 5ϵ. Chen, Ying and Zhang We then show that for most of the person points i / I1(ϵ) and for most of the item points j / I2(ϵ), we still have FN,J( θi) θ+ i and FN,J( aj) a+ j , under the same isometry mapping FN,J as in Lemma 33. This is formalized in Lemma 34 below. Lemma 34 For N, J large enough, there exists a constant κ > 0, such that for the isometry mapping FN,J defined in Lemma 33, the proportions PN i=1 1{ FN,J( θi) θ+ i >κϵ} N PJ j=1 1{ FN,J( aj) a+ j >κϵ} J satisfy λk,N,J 0, (15) for k = 1, 2, as N, J grow to infinity. Since by Lemma 33, we have FN,J maps BK+ 0 (M) to BK+ 0 (4M), then for all θi and for all aj, FN,J( θi) θ+ i 5M and FN,J( aj) a+ j 5M. Combining this with Lemma 34, we have (PN i=1 F( θi) θ+ i 2 PJ j=1 F( aj) a+ j 2 PN i=1 FN,J( θi) θ+ i 2 PJ j=1 FN,J( aj) a+ j 2 J 25(M)2λ1,N,J + κ2ϵ2 + 25(M)2λ2,N,J + κ2ϵ2 25(M)2(λ1,N,J + λ2,N,J) + 2κ2ϵ2 By (15), (5) holds. (6) holds if ϵ can be arbitrarily small. We complete the proof. Proof [Proof of Theorem 12] Combining Theorem 6 and Proposition 11, we have the result. Proof [Proof of Theorem 14] Theorem 14 is a special case of Proposition 22. See the proof of Proposition 22. Proof [Proof of Theorem 23] For simplicity of writing, we suppose K+ = K in this proof. We only prove the result for the respondents. The proof for the items is the same. Under Unfolding-Model-Based Visualization: Theory, Method and Applications the conditions of Theorem 23, the result of Theorem 12 is satisfied and with a slight change in the proof, we can get PN i=1 ˆθi F(b ϑ i ) Consequently, there exists isometry F N,J, such that PN i=1 ˆθi F N,J(b ϑ i ) 2 N = op(1), (17) noting that b ϑ i = θ i . Lemma 35 Under the same conditions as Theorem 23, suppose that PN i=1 ˆθi F N,J(b ϑ i ) 2 Then we have PN i=1 1{ϑ i =ζ(ˆϑi)} N = op(1). With Lemma 35, we complete the proof for the respondents. B.3 Proof of Propositions Proof [Proof of Proposition 3] It suffices to prove in the case when Pn i=1 xi = 0 and Pn i=1 yi = 0. Denote D = (dij)n n, where dij = xi xj 2 = yi yj 2 and let B = (bij)n n = 1 2JDJ, where J = In 1n1 n /n. Then B is inner product matrix of both {x1, ..., xn} and {y1, ..., yn}. That is, bij = x i xj = yiy j , for 1 i, j n. We refer readers to Critchley (1988) for the relation between inner product matrix and distance matrix. So if we denote P1 = (x1, ..., xn) , P2 = (y1, ..., yn) , then we have P1P 1 = P2P 2 = B. Let P 1 = Q1R1, P 2 = Q2R2 be the QR decomposition (see Cheney and Kincaid (2009)) of P1, P2, where Q1, Q2 are k k orthogonal matrix and R1, R2 are k n upper-triangular matrix with non-negative diagonal entries. Since x i xj = y i yj, for 1 i, j n, it is not difficult to check that R1 = R2. If we define O = Q2Q 1 , then OP 1 = OQ1R1 = Q2Q 1 Q1R1 = Q2R1 = Q2R2 = P 2 , which means Oxi = yi, for 1 i n. We complete the proof. Proof [Proof of Proposition 8] We first introduce a lemma as following. Chen, Ying and Zhang Lemma 36 There exists a collection of anchor points {b 1, ..., b k2}, {c 1, ..., c k2} int(G), where G is the ball defined in Proposition 8. We fix such collection of anchor points. For any ϵ > 0, we denote B k(ϵ), G l (ϵ), for 1 k k1 and 1 l k2, as balls centered at b k and c l , respectively. For sufficiently small ϵ > 0, it is easy to see that for any b1 B 1(ϵ), ..., bk1 B k1(ϵ), c1 G 1(ϵ), ..., ck2 G k2(ϵ), the {b1, ..., bk1}, {c1, ..., ck2} are a collection of anchor points in RK. Therefore, the (1) of A3 holds. We define βϵ := 1 2 min 1 k k1 1 l k2 {P1B k(ϵ), P2G l (ϵ)} and use AN,J to denote the following event i=1 1{θ i B k(ϵ)} P1B k(ϵ) βϵ, k = 1, ..., k1, j=1 1{a j G l (ϵ)} P2G l (ϵ) βϵ, l = 1, ..., k2, where P1B k(ϵ), P2G l (ϵ) represent the probability measure of B k(ϵ), G l (ϵ) with respect to P1 and P2, respectively. By Hoeffding s inequality, we have Pr ((18) holds ) 1 2k1 exp( 1 2Nβ2 ϵ ) 2k2 exp( 1 2Jβ2 ϵ ). (19) So we have Pr(AN,J) 1 as N, J grow. On AN,J, we have i=1 1{θ i B k(ϵ)} βϵ, 1 k k1, j=1 1{a j G l (ϵ)} βϵ, 1 l k2. On AN,J, (20) holds. Then, the (2) of A3 holds almost surely. Proof [Proof of Proposition 11] Proposition 11 is a special case of Proposition 20. See the proof of Proposition 20. Proof [Proof of Proposition 20] The proof of Proposition 20 is similar to Theorem 1 of Davenport et al. (2014). We only state the main steps. Unfolding-Model-Based Visualization: Theory, Method and Applications We denote D as the partial distance matrix of (θ1, ..., θN) and (a1, ..., a J) (to simplify the notation, we ignore the subscripts N and J for D). Since the likelihood function depends on (θ1, ..., θN) and (a1, ..., a J) only through their partial distance matrix, we re-parameterize the likelihood function by D. We denote lΩ,Y (D) = log LΩ(θ1, ..., θN, a1, ..., a J), where the subscripts Ω= (ωij)N J and Y = (Yij)N J indicate the random variables in the likelihood function and D contains the parameters. Let lΩ,Y (D) = lΩ,Y (D) lΩ,Y (0), (21) where 0 represents an N J matrix whose elements are all 0 and let G = n D RN J : D 4M2p (K+ + 2)NJ o . (22) Lemma 37 Under the same conditions as Proposition 20, there exist constant C1 and C2 such that Pr sup D G | lΩ,Y (D) E lΩ,Y (D)| 4M2C1L4M2 p n(N + J) + NJ log(NJ) Let H = {D : dij = θi aj 2, where θi , aj M, i = 1, ..., N, j = 1, ..., J}. It is easy to check that H G. Consequently, Pr sup D H | lΩ,Y (D) E lΩ,Y (D)| 4C1M2L4M2 p n(N + J) + NJ log(NJ) Pr sup D G | lΩ,Y (D) E lΩ,Y (D)| 4C1M2L4M2 p n(N + J) + NJ log(NJ) Given the above development, Proposition 20 is implied by the following lemma. Lemma 38 Under the same conditions as Proposition 20, 1 NJ D N,J ˆDN,J 2 F 16 n β4M2 sup D H | lΩ,Y (D) E lΩ,Y (D)|. Therefore, with probability at least 1 C2/(N + J), 1 NJ D N,J ˆDN,J 2 F 64C1M2L4M2β4M2 p 1 + NJ log(NJ) Chen, Ying and Zhang We complete the proof by absorbing 64 K+ + 2 into C1. Proof [Proof of Proposition 22] We use AN,J to denote the event that the result in Proposition 20 holds. By Theorem 6 and Proposition 8, on AN,J, we have PN i=1 θ+ i F(ˆθ Ω i ) 2 PJ j=1 a+ j F(ˆaΩ j ) 2 goes to 0, as N, J grow to infinity. Since Pr(AN,J) 0, we complete the proof. B.4 Proof of Lemmas Proof [Proof of Lemma 29] Let τ1 = [x1, ..., xn], τ2 = [y1, ..., yn], τ3 = [z1, ..., zn]. Define d(τ1, τ2) := min F AK max i F(yi) xi and it is easy to check that d(τ1, τ2) d(τ1, τ2) n d(τ1, τ2). So we just need to verify that function d( , ) satisfies the triangle inequality. Let isometries F21, F31 satisfy d(τ1, τ2) = max i F21(yi) xi = F21(yl) xl , d(τ1, τ3) = max i F31(zi) xi = F31(zm) xm . d(τ2, τ3) max i { F31(zi) F21(yi) } max i { F31(zi) xi + F21(yi) xi } F21(yl) xl + F31(zm) xm = d(τ1, τ2) + d(τ1, τ3). We complete the proof. Proof [Proof of Lemma 31] Otherwise there exist ϵ0 > 0 and sequences {τ (n) 1 } n=1 Hk1+k2,K+,M, and {τ (n) 2 } n=1 B such that Φk1,k2,K+(τ (n) 1 ) Φk1,k2,K+(τ (n) 2 ) F < 1 and d(τ (n) 1 , τ (n) 2 ) > ϵ0. Unfolding-Model-Based Visualization: Theory, Method and Applications Since both Hk1+k2,K+,M and B are compact, there exists a subsequence {nk} k=1 N+, such that limk τ (nk) 1 = τ Hk1+k2,K+,M and limk τ (nk) 2 = τ0 B. The two configurations τ and τ0 have the same partial distance matrix but d( τ, τ0) > ϵ0. This makes a contradiction because τ0 B is the only configuration of its partial distance matrix, by the requirement of B. Proof [Proof of Lemma 32] For a collection of points {x, b 1, ..., b k1}, {c 1, ..., c k2}, it is not difficult to verify that condition A2 holds. So we only need to verify A1. To verify A1, it suffices to show that if {b1, ..., bk1}, {c1, ..., ck2} is a collection of anchor points, then for any x BK 0 (C), [x, b1, ..., bk1, c1, ..., ck2] is the unique configuration corresponding to its (k1 + 1) k2 partial distance matrix. Suppose that τ = [x, b1, ..., bk1, c1, ...ck2] and τ = [x , b 1, ..., b k1, c 1, ...c k2] satisfy Φk1+1,k2,K(τ) = Φk1+1,k2,K(τ ). Then Φk1,k2,K([b1, ..., bk2, c1, ..., ck2]) = Φk1,k2,K([b 1, ..., b k1, c 1, ...c k2]). Since {b1, ..., bk1}, {c1, ...ck2} are a collection of anchor points, then [b1, ..., bk1, c1, ...ck2] = [b 1, ..., b k1, c 1, ...c k2]. Without loss of generality, we suppose bl = b l and cm = c m. Then, the two configurations, [x, c1, ..., ck2] and [x, c 1, ..., c k2], have the same complete distance matrix, which further leads that [x, c1, ..., ck2] = [x , c1, ..., ck2]. Since c1, ..., ck2 can affine span RK, it is not difficult to see that x = x . Then, we get τ = τ , and A1 has been verified. Proof [Proof of Lemma 33] We define S N,J(ϵ) = B 1(ϵ), ..., B k1(ϵ), G 1(ϵ), ..., G k2(ϵ) Hk1+k2,K,M, SN,J(ϵ) = h B1(ϵ), ..., Bk1(ϵ), G1(ϵ), ..., Gk2(ϵ) i Hk1+k2,K+,M, where B k(ϵ), Bk(ϵ), G l (ϵ), Gl(ϵ) are defined in the proof of Theorem 6. Let σN,J := d( SN,J(ϵ), S N,J(ϵ)) (23) By (13) and triangle inequality, there exists an iosmetry FN,J AK+, such that for all x k B k(ϵ), y l G l (ϵ), xk Bk(ϵ), yl Gl(ϵ), FN,J( xk) x+ k 4ϵ + σN,J, 1 k k1, FN,J( yl) y+ l 4ϵ + σN,J, 1 l k2. (24) In what follows, we will show that σN,J ϵ for N, J large enough. We first define γN,J = inf{ Φk1,k2,K+( τ) Φk1,k2,K+(τ ) F : τ SN,J(ϵ), τ S N,J(ϵ)} (25) Chen, Ying and Zhang and we have γ2 N,J(pϵN)(pϵJ) DN,J D N,J 2 F = o(NJ), which leads to γN,J = o(1). (26) By (23), there exist τ SN,J(ϵ) and τ S N,J(ϵ) such that Φk1,k2,K+( τ) Φk1,k2,K+(τ ) F 2γN,J. Then by (23), we have σN,J = d( SN,J(ϵ), S N,J(ϵ)) d(τ , τ). (27) As shown in the beginning of proof for Theorem 6 and according to Definition 4, the τ is the unique configuration corresponding to its k1 k2 partial distance matrix. Since τ SN,J(ϵ) Hk1+k2,K+,M, by Lemma 31, we know d(τ , τ) 0 as N, J grow to infinity, and thus d(τ , τ) < ϵ (28) for N, J large enough. Finally, since B k(ϵ), G l (ϵ) BK 0 (M), Bk(ϵ), Gl(ϵ) BK+ 0 (M), we have, for N, J large enough, FN,J(x) 4M, for x BK+ 0 (M). To see this, if there exists x BK+ 0 (M) such that FN,J(x) > 4M, then by simple geometry, min x B K+ 0 (M) FN,J(x) x > M. According to (24) and (28), we will get M < FN,J( xk) x+ k 4ϵ + σN,J 5ϵ, which contradicts with the fact that ϵ < 1 Proof [Proof of Lemma 34] Let c1, ..., ck2 denote the centers of G1(ϵ), ...., Gk2(ϵ) and c+ l = ( c l , 0 ) RK+. We first give the following lemma. Lemma 39 For any τ1 = [x, x1, ..., xk2] [BK 0 (M), G 1(ϵ), ..., G k2(ϵ)], τ2 = [y, y1, ..., yk2] [BK+ 0 (M), B c+ 1 (ϵ), ..., B c+ k2(ϵ)], l=1 x+ l yl 2 for a constant c, which only depends on the set {c 1, ..., c k2} and M. Unfolding-Model-Based Visualization: Theory, Method and Applications H1(ϵ) := {i [N] : FN,J( θi) θ+ i > 5 max(c, 1) p k1 + k2ϵ} (29) and H2(ϵ) := {j [J] : FN,J( aj) a+ j > 5 max(c, 1) p k1 + k2ϵ}, (30) where c is the constant in Lemma 39. We set the constant κ in Lemma 34 to be 5 max(c, 1) k1 + k2ϵ. and then we have |H1(ϵ)| = Nλ1,N,J, |H2(ϵ)| = Jλ2,N,J. Note that I1(ϵ) H1(ϵ) = , I2(ϵ) H2(ϵ) = for N, J large. We choose i1, ..., ik1 I1(ϵ) and j1, ..., jk2 I2(ϵ) such that θ ik B k(ϵ), θik Bk(ϵ), a jl G l (ϵ), ajl Gl(ϵ) for 1 k k1 and 1 l k2. For any i H1(ϵ), we consider the following configurations τ = [θ i , θ i1, ..., θ ik1, a j1, ..., a jk2] Hk1+k2+1,K,M, τ = [ θi, θi1, ..., θik1, aj1, ..., ajk2] Hk1+k2+1,K+,M and τ 1 = [θ i , a j1, ..., a jk2] [BK 0 (M), G 1(ϵ), ..., G k2(ϵ)], τ1 = [ θi, aj1, ..., ajk2] [BK+ 0 (M), G1(ϵ), ..., Gk2(ϵ)]. It is obvious that d( τ, τ ) d ( τ1, τ 1 ) . By Lemma 33, we have v u u t l=1 FN,J( ajl) a jl 2 5 p Combining it with (29) and Lemma (39), we have d( τ1, τ 1 ) > 5 p which leads to d( τ, τ ) > 5 p k1 + k2ϵ. (31) According to Lemma 32, {θ i , θ i1, ..., θ ik1}, {a j1, ..., a jk2} are a collection of anchor points. Let D, D Pk1+1,k2,K+,M be the partial distance matrix of τ and τ , respectively. Combining (31) and Lemma 31, there exists a constant δϵ > 0 such that D D F δϵ. (32) For each i H1(ϵ), we choose i1, ..., ik1 I1(ϵ) to form a group {i, i1, ..., ik1} [N] such that (θ i , θ i1, ..., θ ik1) BK 0 (M) B 1(ϵ) B k1(ϵ) Chen, Ying and Zhang and ( θi, θi1, ..., θik1) BK+ 0 (M) B1(ϵ) Bk1(ϵ). We could find at least min{λ1,N,J, pϵ} N such groups which are mutually exclusive. We could also find at least pϵJ mutually exclusive groups of {j1, ..., jk2} [J] such that (a j1, ..., a jk2) G 1(ϵ) G k2(ϵ) and ( aj1, ..., ajk2) G1(ϵ) Gk2(ϵ). By (4) and (32), we have min{λ1,N,J, pϵ}NpϵJδ2 ϵ o(NJ). So min{λ1,N,J, pϵ} = o(1), which means λ1,N,J 0, as N, J grow to infinity. Similar result holds for λ2,N,J and we do not repeat it. Proof [Proof of Lemma 35] Consider the K-means clustering of the person points in Algorithm 2. We define a loss function L(ϑ1, ..., ϑN) = 1 i=1 ˆθi µϑi 2, as the loss function for K-means clustering, where ϑi {1, ..., k1} represents the cluster membership of person i and PN i=1 ˆθi1{ϑi=k} PN i=1 1{ϑi=k} denotes the centroid of the kth cluster. Under the conditions of Theorem 23, the K-means clustering converges to the global optima, which implies that L(ˆϑ1, ..., ˆϑN) = min ϑi {1,...,k1},i=1,...,N L(ϑ1, ..., ϑN). (33) So for any isometry F AK, i=1 ˆθi µˆϑi 2 i=1 ˆθi F(b ϑ i ) 2. By triangle inequality, i=1 µˆϑi F(b ϑ i ) 2 ! 1 i=1 ( µˆϑi ˆθi 2 ! 1 i=1 ˆθi F(b ϑ i ) 2 ! 1 i=1 ˆθi F(b ϑ i ) 2 ! 1 Unfolding-Model-Based Visualization: Theory, Method and Applications Define d = mini =j b i b j and for F AK, define AF := {1 i N : µˆϑi F(b ϑ i ) < d and denote Ac F := {1, ..., N}/AF . Then P i AF N,J 1 P i Ac F N,J 1 P i Ac F N,J µˆϑi F N,J(b ϑ i ) 2 PN i=1 µˆϑi F N,J(b ϑ i ) 2 PN i=1 ˆθi F N,J(b ϑ i ) 2 Lemma 40 Under the same conditions as Lemma 35, if there exists ζ1 Bk1 satisfying µζ1(l) F N,J(b l ) < d where µl is the centroid of the lth cluster, F N,J is defined in (17) and d is defined above, then there exists ζ2 Bk1, such that for all i AF N,J, ˆϑi = ζ2(ϑ i ). Let ΩN,J := {ω : ζ Bk1, s.t. µζ(l)(ω) F N,J(b l ) < d 2, i = 1, ..., k1}. Notice that ΩN,J is a subset of the whole probability space. By Lemma 40, for any ω ΩN,J, there exists ζN,J Bk1, which corresponds to ζ2 in Lemma 40, such that PN i=1 1{ϑ i =ζ(ˆϑi(ω))} N PN i=1 1{ˆϑi=ζN,J(ϑ i )} N P i AF N,J 1 Lemma 41 Under the same conditions as Lemma 35, we have lim N,J Pr (ΩN,J) = 1, where ΩN,J is defined above. By Lemma 41 and (34), we complete the proof. Proof [Proof of Lemma 36] Without loss of generality, we suppose that the ball G RK has center at orgin. By Theorem 2.1 of Alfakih (2005), we know there exist k1, k2 K + 1 and two sets of points, {b 1, ..., b k1}, {c 1, ..., c k2} int(G), satisfying condition A2 whose Chen, Ying and Zhang partial distance matrix D has unique configuration. Furthermore, points near b i , c j also have this property. Specifically, there exists ϵ > 0 such that for bi BK b i (ϵ) G, cj BK c j (ϵ) G, the {b1, ..., bk1}, {c1, ..., ck2} satisfy condition A2 and their partial distance matrix D has unique configuration. Then, by Lemma 31, condition A1 holds and {b 1, ..., b k1}, {c 1, ..., c k2} are anchor points in RK. Proof [Proof of Lemma 37] The proof of Lemma 37 is similar to Lemma A.1 of Davenport et al. (2014). Proof [Proof of Lemma 38] We have 0 lΩ,Y ( ˆDN,J) lΩ,Y (D N,J) = lΩ,Y ( ˆDN,J) E lΩ,Y ( ˆDN,J) + E lΩ,Y ( ˆDN,J) E lΩ,Y (D N,J) + E lΩ,Y (D N,J) lΩ,Y (D N,J) E lΩ,Y ( ˆDN,J) E lΩ,Y (D N,J) + 2 sup D H | lΩ,Y (D) lΩ,Y (D)|. So E lΩ,Y (D N,J) lΩ,Y ( ˆDN,J) 2 sup D H | lΩ,Y (D) lΩ,Y (D)|. Notice that E lΩ,Y (D N,J) lΩ,Y ( ˆDN,J) = E lΩ,Y (D N,J) lΩ,Y ( ˆDN,J) i,j f(d ij) log( f(d ij) f( ˆdij) ) + (1 f(d ij)) log( 1 f(d ij) 1 f( ˆdij) ) For two distributions P and Q, let DKL(P Q) denote the Kullback-Leibler divergence DKL(P Q) := Z p(x) log p(x) where p(x) and q(x) are the density functions for P and Q, respectively. For 0 < p, q < 1, we use DKL(p q) := p log(p q ) + (1 p) log(1 p to denote the Kullback-Leibler divergence between two Bernoulli distributions with parameter p and q, respectively. For P, Q (0, 1)N J, we define DKL(P Q) := 1 NJ i,j DKL(Pij Qij). For a partial distance matrix DN,J, denote f(DN,J) as the matrix (f(dij))N J. So from above, we know that n DKL(f(D N,J) f( ˆDN,J)) 2 sup D H | lΩ,Y (D) lΩ,Y (D)|. Unfolding-Model-Based Visualization: Theory, Method and Applications Still for 0 < p, q < 1, let d2 H(p, q) := ( p q)2 + ( p denote the Hellinger distance between two Bernoulli distributions with parameters p and q, respectively. For P, Q (0, 1)N J, we define d2 H(P Q) := 1 NJ i,j d2 H(Pij, Qij). It is easy to check that d2 H(p, q) DKL(p q). So d2 H(f(D N,J), f( ˆDN,J)) 2 n sup D H | lΩ,Y (D) lΩ,Y (D)| By Lemma A.2 of Davenport et al. (2014), we have 1 NJ ˆDN,J D N,J 2 F 8β4M2d2 H(f(D N,J), f( ˆDN,J)) n β4M2 sup D H | lΩ,Y (D) lΩ,Y (D)|. Proof [Proof of Lemma 39] Denote l=1 x+ l yl 2 and then d(τ1, τ2) η and x+ l yl η, l = 1, ..., k2. (35) Therefore there exist A OK+ and b RK+ such that v u u t Ax+ + b y 2 + l=1 Ax+ l + b yl 2 η, which leads that Ax+ + b y η (36) and Ax+ l + b yl η, l = 1, ..., k2. (37) Combining (35) and (37), we get Ax+ l + b x+ l 2η, l = 1, ..., k2. Chen, Ying and Zhang According to condition A2, x1, ..., xk2 can affine span RK. Then there exists α1, ..., αk1 satisfying Pk2 l=1 αl = 1, such that x = Pk2 l=1 αlxl. So we have Ax+ + b x+ = l=1 αl(Ax+ l + b x+ l ) j=1 |αj|)η. Combining it with (36), we have x+ y (2 k1 P j=1 |αj| + 1)η. We complete the proof by setting the constant c in Lemma 39 to be max x BK 0 (M) xl G l (ϵ) inf P l αl=1 x=P l αlxl Proof [Proof of Lemma 40] For i, j AF N,J, if ϑ i = ϑ j and suppose they are both equal to k, then µˆϑi F N,J(b k) < d/2 and µˆϑj F N,J(b k) < d/2. Given the condition in Lemma 40, it is easy to check that there is only one µl among {µ1, ..., µk1} satisfying µl F N,J(b k) < d/2, then ˆϑi = ˆϑj. If ϑ i = ϑ j, then µˆϑi F N,J(b ϑ i ) < d/2 and µˆϑj F N,J(b ˆϑj) < d/2. So µˆϑi µˆϑj F N,J(b ϑ i ) F N,J(b ϑ j ) µˆϑi F N,J(b ϑ i ) µˆϑj F N,J(b ˆϑj) which means ˆϑi = ˆϑj. So there exists ζ2 such that for i AF N,J, ˆϑi = ζ2(ϑ i ). Proof [Proof of Lemma 41] Let Γ(ϵ ) N,J := {ω : P i AF N,J 1 which is a subset of the whole probability space. By (34), for any ϵ > 0, we have lim N,J Pr(Γ(ϵ ) N,J) = 1. For any ω / ΩN,J, there exists l such that µm(ω) F N,J(b l ) d/2, for m = 1, ..., k1. So for i satisfying ϑ i = l, we have µˆϑi(ω) F N,J(b ϑ i ) d/2. According to (2) of condition A3, for sufficiently small ϵ , if N, J are sufficiently large, then ω / Γ(ϵ ) N,J, which means Γ(ϵ ) N,J ΩN,J. By (34), for sufficiently small ϵ , lim N,J Pr(ΩN,J) lim N,J Pr(Γ(ϵ ) N,J) = 1. We complete the proof. Unfolding-Model-Based Visualization: Theory, Method and Applications 1.0 0.5 0.0 0.5 1.0 1.5 Figure 10: Analysis of movie rating data: Simultaneous visualization of the estimated movie and user points. Appendix C. Algorithm-based MDU: Real Data Examples To compare the proposed method with classical algorithm-based MDU methods, we apply ordinal MDU (Busing et al., 2005) to both real datasets analyzed in the paper. The application is based on the implementation in R package smacof (de Leeuw and Mair, 2009). For both examples, the latent dimension is set to two, and all the tuning parameters are set to be the default ones. The results below show that the ordinal MDU approach provides similar visualization results as the proposed one, especially for the roll call voting data due to its unidimensional nature. The results for the movie rating dataset are also similar for the two methods, but the interpretable patterns from the ordinal MDU approach is not as clear as the proposed one. Figures 10 through 12 show the same plots as in Figures 4 through 6 in Section 5.1, respectively, for the movie rating dataset. Figure 10 provides the simultaneous visualization of the movie and user points. Similar to the plot in Figure 4 given by our method, the movies and the users tend to form two giant clusters that only slightly overlap. Figure 11 is similar to Figure 5, where the two panels show the same scatter plot for the movie points. In the left panel, the movies are stratified by the the numbers of ratings that they received, where different stratums are marked by different colors. In the right panel, the movies are stratified by their release time. Recall that the patterns of popularity and release time are captured by the proposed method as shown in Figure 5. Figure 11 seems also to capture these patterns, but not as clear as those in Figure 5. According to panel (a) of Figure 11, the more popular movies tend to be located near the origin, while the less popular movies tend to be located away from the origin. According to panel (b) of Figure 11, the clustering patten of the movies can be largely explained by the three Chen, Ying and Zhang 1.5 1.0 0.5 0.0 0.5 1.0 1.5 <=127 128 169 170 229 >=230 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1997 1998 1995 1996 before 1995 Figure 11: Analysis of movie rating data. Panel (a): Visualization of movie points, with movies stratified into four equal-size categories based on the numbers of rating. Movies with numbers of rating less than 127, 128-169, 170-229 and more than 230 are indicated by black, red, green and blue points, respectively. Panel (b): Visualization of movie points, with movies stratified into three categories based on their release time. Movies released in 1997-1998, 1995-1996, and before 1995 are indicated by green, red and black points, respectively. categories of release dates. From the left to the right of the space, the points correspond to movies from the relatively older ones to the relatively more recent ones. Figure 12 shows the same plots as in Figure 6. Similar pattern is shown that the shorter the average distance from a user point to the movies points, the more active the user is. In Figure 12, users are classified into four equal-size groups depending on the numbers of movies they rated. These groups of users, from the most active one to the least active one, lie from the top left to the bottom right. Figures 13 through 15 show the same plots as in Figures 7 through 9 in Section 5.2, respectively, for the roll call voting dataset. Figure 13 provides the simultaneous visualization of senators and roll calls. Similar to the plot in Figure 7, most of the points tend to lie on a straight line. Figure 14 provides a scatter plot of the senator points. Similar to Figure 8, most of the senator points tend to locate around a straight line, with the Democrats on one side and the Republicans on the other side. Also similar to Figure 8, the independent senator, Jim Jeffords from the state of Vermont, is mixed together with the Democrats, while the Democrat senator, Zell Miller from the state of Georgia, is mixed together with the Republicans. Finally, Figure 15 shows the unfolding results for the roll calls. The pattern in panel (a) of Figure 15 is similar to that of Figure 9, where from the right to the left, the proportion of Yeas from the Republicans increases. Also similar to Figure 9, although most of the roll calls lie near the x-axis, there are still quite a few of them spreading out along the Unfolding-Model-Based Visualization: Theory, Method and Applications 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 <=24 25 47 48 103 >=104 Figure 12: Analysis of movie rating data: Visualization of user points, with users classified into four equal-size categories based on the numbers of rating. Users who rated less than 24, , 25-47, 48-103 and more than 104 movies are indicated by black, red, green and blue points, respectively. 2 1 0 1 2 3 senator roll call Figure 13: Analysis of senator roll call data: Simultaneous visualization of the estimated senator and roll call ideal points. Chen, Ying and Zhang 1.0 0.5 0.0 0.5 1.0 Democrats Republican Independent Figure 14: Analysis of senator roll call data: Visualization of senator points, where senators are classified by their party membership. Specifically, The Democrats, Republicans and an independent politician are indicated by blue, red, and green, respectively. y-axis. According to panel (b) of Figure 15 based on the cross entropy measure, the voting behavior on these roll calls tends to be heterogeneous within both parties. This result is similar to that given in panel (b) of Figure 9. Unfolding-Model-Based Visualization: Theory, Method and Applications 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2 1 0 1 2 3 <0.068 0.068 0.52 0.52 0.73 >0.73 0.0 0.2 0.4 0.6 0.8 Figure 15: Analysis of senator roll call data. Panel (a): Visualization of roll call points, where roll calls are classified by the proportion of Yeas from Republicans. Specifically, roll calls who have the proportions less than 0.068, 0.068-0.52,0.52-0.73 and larger than 0.73 are indicated by black, red, green and blue points, respectively. Panel (b): Box plots of min{CE(1) j , CE(2) j }, for roll calls lying near the x-axis (|ˆaj2| 0.05) one the left and for those spreading out along the y-axis (|ˆaj2| > 0.05) on the right. Chen, Ying and Zhang Abdo Y. Alfakih. On the uniqueness of euclidean distance matrix completions: the case of points in general position. Linear Algebra and its Applications, 397:265 277, 2005. Francis R. Bach. Consistency of trace norm minimization. Journal of Machine Learning Research, 9:1019 1048, 2008. Jushan Bai and Serena Ng. Determining the number of factors in approximate factor models. Econometrica, 70:191 221, 2002. Ryan Bakker and Keith T. Poole. Bayesian metric multidimensional scaling. Political Analysis, 21:125 140, 2013. David J. Bartholomew, Martin Knott, and Irini Moustaki. Latent variable models and factor analysis: A unified approach. John Wiley & Sons, West Sussex, UK, 2011. Joseph F. Bennett. Determination of the number of independent parameters of a score matrix from the examination of rank orders. Psychometrika, 21:383 393, 1956. Joseph F. Bennett and William L. Hays. Multidimensional unfolding: Determining the dimensionality of ranked preference data. Psychometrika, 25:27 43, 1960. Ingwer Borg and Patrick J. F. Groenen. Modern multidimensional scaling: Theory and applications. Springer, New York, NY, 2005. Frank M. T. A. Busing, Patrick J. K. Groenen, and Willem J. Heiser. Avoiding degeneracy in multidimensional unfolding by penalizing on the coefficient of variation. Psychometrika, 70:71 98, 2005. Tony Cai and Wen-Xin Zhou. A max-norm constrained minimization approach to 1-bit matrix completion. Journal of Machine Learning Research, 14:3619 3647, 2013. Kehui Chen and Jing Lei. Network cross-validation for determining the number of communities in network data. Journal of the American Statistical Association, 113:241 251, 2018. Lisha Chen and Andreas Buja. Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis. Journal of the American Statistical Association, 104:209 219, 2009. Yunxiao Chen, Xiaoou Li, and Siliang Zhang. Structured latent factor analysis for largescale data: Identifiability, estimability, and their implications. Journal of the American Statistical Association, 115:1756 1770, 2020. Yuxin Chen, Jianqing Fan, Cong Ma, and Yuling Yan. Inference and uncertainty quantification for noisy matrix completion. Proceedings of the National Academy of Sciences, 116:22931 22937, 2019. Ward Cheney and David Kincaid. Linear algebra: Theory and applications. The Australian Mathematical Society, 110:544 550, 2009. Unfolding-Model-Based Visualization: Theory, Method and Applications David Choi and Patrick J. Wolfe. Co-clustering separately exchangeable network data. The Annals of Statistics, 42:29 63, 2014. Joshua Clinton, Simon Jackman, and Douglas Rivers. The statistical analysis of roll call data. American Political Science Review, 98:355 370, 2004a. Joshua D. Clinton, Simon Jackman, and Doug Rivers. The most liberal senator ? Analyzing and interpreting congressional roll calls. PS: Political Science & Politics, 37:805 811, 2004b. Clyde H. Coombs. A theory of data. Wiley, New York, NY, 1964. Frank Critchley. On certain linear mappings between inner-product and squared-distance matrices. Linear Algebra and its Applications, 105:91 107, 1988. Mark A. Davenport, Yaniv Plan, Ewout van den Berg, and Mary Wootters. 1-bit matrix completion. Information and Inference: A Journal of the IMA, 3:189 223, 2014. Jan de Leeuw and Patrick Mair. Multidimensional scaling using majorization: SMACOF in R. Journal of Statistical Software, 31:1 30, 2009. URL http://www.jstatsoft.org/ v31/i03/. Wayne S. De Sarbo and Donna L. Hoffman. Constructing MDS joint spaces from binary choice data: A multidimensional unfolding threshold model for marketing research. Journal of Marketing Research, 24:40 54, 1987. Wayne S. De Sarbo, Martin R. Young, and Arvind Rangaswamy. A parametric multidimensional unfolding procedure for incomplete nonmetric preference/choice set data in marketing research. Journal of Marketing Research, 34:499 516, 1997. Inderjit S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 269 274, 2001. David Donoho and Victoria Stodden. When does non-negative matrix factorization give a correct decomposition into parts? In S. Thrun, LK. Saul, and B. Sch olkopf, editors, Advances in neural information processing systems, pages 1141 1148. MIT Press, Cambridge, MA, 2004. Susan E. Embretson and Steven P. Reise. Item response theory. Psychology Press, Hove, UK, 2000. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Springer, New York, NY, 2001. Albert Gifi. Nonlinear multivariate analysis. Wiley, New York, NY, 1990. Michael J. Greenacre and Michael W. Browne. An efficient alternating least-squares algorithm to perform multidimensional unfolding. Psychometrika, 51:241 250, 1986. Chen, Ying and Zhang F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (Tii S), 5:1 19, 2016. John A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67:123 129, 1972. William L. Hays and Joseph F. Bennett. Multidimensional unfolding: Determining configuration from complete rank order preference data. Psychometrika, 26:221 238, 1961. Melvin J. Hinich. A new method for statistical multidimensional unfolding. Communications in Statistics Theory and Methods, 34:2299 2310, 2005. Ying Ho, Yuho Chung, and Kin-nam Lau. Unfolding large-scale marketing data. International Journal of Research in Marketing, 27:119 132, 2010. Paul W. Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social Networks, 5:109 137, 1983. Patrik O. Hoyer. Non-negative matrix factorization with sparseness constraints. Journal of machine learning research, 5:1457 1469, 2004. Joseph B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29:1 27, 1964. Joseph B. Kruskal and Myron Wish. Multidimensional scaling. Sage, Beverly Hills, CA, 1978. Brigitte Le Roux and Henry Rouanet. Multiple correspondence analysis. Sage, Newbury Park, CA, 2010. Tianxi Li, Elizaveta Levina, and Ji Zhu. Network cross-validation by edge sampling. Biometrika, 107:257 276, 2020. Fan Lu, S und uz Kele s, Stephen J. Wright, and Grace Wahba. Framework for kernel regularization with application to protein clustering. Proceedings of the National Academy of Sciences, 102:12332 12337, 2005. James Mac Queen. Some methods for classification and analysis of multivariate observations. In Lucien M. Le Cam and Jerzy Neyman, editors, Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pages 281 297. University of California Press, Berkeley, CA, 1967. Patrick Mair, Jan de Leeuw, and Marcus Wurzer. Multidimensional unfolding. In Wiley Stats Ref: Statistics Reference Online. New York: Wiley, 2015. Patrick Mair, Ingwer Borg, and Thomas Rusch. Goodness-of-fit assessment in multidimensional scaling and unfolding. Multivariate Behavioral Research, 51:772 789, 2016. Peter J. Olver. Classical invariant theory. Cambridge University Press, Cambridge, UK, 1999. Unfolding-Model-Based Visualization: Theory, Method and Applications Megan H. Papesh and Stephen D. Goldinger. A multidimensional scaling analysis of ownand cross-race face spaces. Cognition, 116:283 288, 2010. Keith T. Poole. Nonparametric unfolding of binary choice data. Political Analysis, 8: 211 237, 2000. Keith T. Poole. Spatial models of parliamentary voting. Cambridge University Press, Cambridge, UK, 2005. Keith T. Poole and Howard Rosenthal. Patterns of congressional voting. American Journal of Political Science, 35:228 278, 1991. Keith T. Poole, Howard Rosenthal, and Kenneth Koford. On dimensionalizing roll call votes in the US congress. American Political Science Review, 85:955 976, 1991. Sophia Rabe-Hesketh and Anders Skrondal. Generalized latent variable modeling: Multilevel, longitudinal, and structural equation models. Chapman and Hall/CRC, New York, NY, 2004. Mark D. Reckase. Multidimensional Item Response Theory. Springer, New York, NY, 2009. Karl Rohe, Tai Qin, and Bin Yu. Co-clustering directed graphs to discover asymmetries and directional communities. Proceedings of the National Academy of Sciences, 113: 12679 12684, 2016. Fumiko Samejima. Graded response model. In Wim J van der Linden and Ronald K Hambleton, editors, Handbook of modern item response theory, pages 85 100. Springer, New York, NY, 1997. Yoshio Takane, Forrest W. Young, and Jan de Leeuw. Nonmetric individual differences multidimensional scaling: An alternating least squares method with optimal scaling features. Psychometrika, 42:7 67, 1977. Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319 2323, 2000. Katrijn Van Deun, Willem J. Heiser, and Luc Delbeke. Multidimensional unfolding by nonmetric multidimensional scaling of Spearman distances in the extended permutation polytope. Multivariate Behavioral Research, 42:103 132, 2007. Dong Xia and Ming Yuan. Statistical inferences of linear forms for noisy matrix completion. ar Xiv preprint ar Xiv:1909.00116, 2019. Luwan Zhang, Grace Wahba, and Ming Yuan. Distance shrinkage and Euclidean embedding via regularized kernel estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78:849 867, 2016.