# unfoldingmodelbased_visualization_theory_method_and_applications__703d25d5.pdf

Journal of Machine Learning Research 22 (2021) 1-51 Submitted 12/18; Revised 10/20; Published 1/21

Unfolding-Model-Based Visualization: Theory, Method and Applications

Yunxiao Chen y.chen186@lse.ac.uk Department of Statistics London School of Economics and Political Science London, WC2A 2AE, UK

Zhiliang Ying zying@stat.columbia.edu Department of Statistics Columbia University New York, NY 10027, USA

Haoran Zhang hrzhang16@fudan.edu.cn Shanghai Center for Mathematical Sciences Fudan University Shanghai, 200433, China

Editor: Tina Eliassi-Rad

Multidimensional unfolding methods are widely used for visualizing item response data. Such methods project respondents and items simultaneously onto a low-dimensional Euclidian space, in which respondents and items are represented by ideal points, with personperson, item-item, and person-item similarities being captured by the Euclidian distances between the points. In this paper, we study the visualization of multidimensional unfolding from a statistical perspective. We cast multidimensional unfolding into an estimation problem, where the respondent and item ideal points are treated as parameters to be estimated. An estimator is then proposed for the simultaneous estimation of these parameters. Asymptotic theory is provided for the recovery of the ideal points, shedding lights on the validity of model-based visualization. An alternating projected gradient descent algorithm is proposed for the parameter estimation. We provide two illustrative examples, one on users movie rating and the other on senate roll call voting. Key words: Multidimensional Unfolding, Data Visualization, Distance Matrix Completion, Item Response Data, Embedding

1. Introduction

Multidimensional unfolding (MDU) methods are widely used as an important data visualization tool in social and behavioral sciences such as psychology (Van Deun et al., 2007; Papesh and Goldinger, 2010), political science (Poole, 2000, 2005; Clinton et al., 2004a; Bakker and Poole, 2013), and marketing (De Sarbo and Hoﬀman, 1987; De Sarbo et al., 1997; Ho et al., 2010). It is regarded as the dominant method in the scaling of both preferential choice and attitude (Mair et al., 2015). The basic idea of MDU is to place both respondents and items in a joint Euclidean space based on data, with the understanding that respondents tend to prefer items that are close to them in the space. This joint visualization may lead to better

c 2021 Yunxiao Chen, Zhiliang Ying and Haoran Zhang.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/18-846.html.

Chen, Ying and Zhang

understanding and interpretations of both the respondents and the items, as compared with separately visualizing the respondents and the items by themselves. MDU has its origin in psychology (Bennett, 1956; Bennett and Hays, 1960; Hays and Bennett, 1961; Coombs, 1964). It is closely related to multidimensional scaling (MDS) methods (Kruskal, 1964; Kruskal and Wish, 1978; Borg and Groenen, 2005) and several other recent approaches to nonlinear dimension reduction and manifold learning (Tenenbaum et al., 2000; Lu et al., 2005; Chen and Buja, 2009; Zhang et al., 2016). MDU methods can be categorized into two types, algorithm-based and model-based. Algorithm-based methods (e.g., Takane et al., 1977; Greenacre and Browne, 1986; de Leeuw and Mair, 2009) estimate the ideal points by minimizing a certain objective function, also known as the stress function in the literature of MDU. The classical algorithm-based methods have been implemented in the R package smacof (de Leeuw and Mair, 2009) that is widely used for MDU and MDS analysis. Model-based methods (e.g., De Sarbo and Hoﬀman, 1987; Hinich, 2005; Bakker and Poole, 2013), however, infer the locations of the ideal points by making use of a probabilistic model. Such a model typically assumes that, up to some measurement error, the similarity between a person and an item is a decreasing function of some deﬁned distance between the corresponding ideal points. The speciﬁcation of MDU models is closely related to item response theory models in psychometrics (see e.g., Embretson and Reise, 2000; Rabe-Hesketh and Skrondal, 2004; Bartholomew et al., 2011). The MDU problem is closely related to MDS. The key diﬀerence is that data for the former do not contain direct measurement of within-set (i.e., person-person and item-item) similarities, while data for MDS typically have such information. Largely due to the missing information contained in the within-set similarities, the MDU problem tends to be more challenging. As a result, degenerate solutions are often encountered in the applications of MDU methods, in which case the visualization and the corresponding interpretations convey no information (e.g., Busing et al., 2005; Borg and Groenen, 2005), while MDS results tend to be more stable. These empirical observations suggest that it is of importance to study the validity of MDU solutions, which motivates the research in this paper. This paper studies the visualization of MDU from the statistical perspective. First, for binary choice data, we formulate the MDU problem into a parameter estimation problem under a general family of probabilistic MDU models, where the respondent and item ideal points are treated as parameters to be estimated. Second, an estimator is proposed for the ideal points and an asymptotic theory is provided for this estimator, shedding lights on the validity of model-based visualization. Finally, an eﬃcient alternating projected gradient algorithm is proposed for the computation which is scalable to large-scale problems. We illustrate the proposed method through two applications, one on movie rating and the other on senate roll call voting. The movie dataset is a subset from the famous Movie Lens dataset (Harper and Konstan, 2016). We unfold the 943 users and 338 movies in the dataset. Speciﬁcally, we study the users movie watching decisions. Based on the ideal points of movies in a two-dimensional space, it is found that one dimension of the space corresponds to the popularity of the movies and the other dimension corresponds to the release date of the movies. Good understanding of the user ideal points is further obtained based on their distances to the movie ideal points. The senate voting dataset is based on the senate roll call voting records from the 108th congress in 2003-2004. Based on the unfolding of the senators and roll calls, it is found that most of the ideal points lie around

Unfolding-Model-Based Visualization: Theory, Method and Applications

a one-dimensional line, with the two extremes of the line representing the most liberal and the most conservative political standings. The rest of the paper is organized as follows. In Section 2, we introduce a family of MDU models and formulate the problem of joint conﬁguration recovery into an estimation problem. In Section 3, we propose an estimator, for which statistical theory is established that guarantees the consistency of conﬁguration recovery under reasonable conditions. Simulation studies and real data examples are presented in Sections 4 and 5, respectively. We end with discussions on future directions in Section 6. An application to cluster analysis, proofs of the theoretical results, and numerical comparison with classical MDU methods are provided as supplementary materials.

2. Distance-based MDU

2.1 Distance-based Unfolding Model for Binary Data

Consider N respondents making choice on J binary items (e.g., agree/disagree ). Let Yij be a random variable, denoting the response from respondent i to item j, taking value 0 or 1, and let yij be its realization. For example, such data can come from senate roll call voting, where the respondents are senators and the items correspond to roll calls. Response Yij = 1 means that senator i supports roll call j and Yij = 0 otherwise. We provide a simulated example in Figure 1 to illustrate MDU analysis. Panel (a) shows the heat map of an observed response matrix which consists of 20 respondents and 10 items, where 0 and 1 responses are represented by red and yellow colors, respectively. Given choice data in panel (a), an MDU method aims at representing respondents and items by ideal points in the same low-dimensional Euclidian space RK as in panel (b) of Figure 1 that can be easily visualized, where the respondent-respondent, respondent-item, and item-item relationships are captured by the between-points distance. The dimension K of the Euclidian space is often set to be 2 or 3 for the purpose of visualization. One way to conduct MDU is via a statistical model. An MDU model typically assumes that each respondent/item is associated with a true ideal point in RK that is represented by a K-dimensional parameter vector. Let θi = (θi1, ..., θi K) and aj = (aj1, ..., aj K)

denote the parameter vectors of respondent i and item j, respectively. It is assumed that response Yij is determined by the Euclidian distance between θi and aj in RK. Finally, we use ΘN = (θik)N K and AJ = (ajk)J K to denote the matrices containing all the person and the item ideal points, respectively. Under such a statistical model, the goal of MDU becomes to estimate the person and item parameters based on data. In this paper, we focus on MDU models taking the form

P(Yij = 1 | θi, aj) = f( θi aj 2), (1)

where denotes the standard L2 norm and f : [0, ) [0, 1] is a pre-speciﬁed link function. It is assumed that the responses Yij are conditionally independent, given the ideal points θi and aj, i = 1, ..., N, j = 1, ..., J. This model falls under the general framework of the MDU threshold model for binary choice data (see De Sarbo and Hoﬀman, 1987). According to the form of (1), the distribution of data only depends on the squared distance between every pair of person and item ideal points, dij = θi aj 2, i = 1, ..., N, j =

Chen, Ying and Zhang

1 2 3 4 5 6 7 8 9

20 16 12 9 6 3

1.0 0.5 0.0 0.5 1.0 1.5

1.5 0.5 0.5

1st dimension

2nd dimension

Figure 1: An illustrative example. Panel (a): The heatmap of a response matrix, where 0 and 1 responses are represented by red and yellow colors, respectively. Panel (b): The respondent and item ideal points where black circles represent respondents and red triangles represent items.

1, ..., J. The matrix DN,J = (dij)N J is known as the corresponding partial distance matrix, where the subscripts of DN,J emphasize the dependence of this matrix on the numbers of respondents and items. In addition, the link function f is often assumed to be a monotone decreasing function, so that a larger distance implies a lower probability of Yij = 1. An example of such a link function is f(x) = 2/(1 + exp(x)). When f(x) takes this form, P(Yij = 1 | θi, aj) = 1 when the distance between θi and aj is 0, i.e., the two points are identical, and the probability P(Yij = 1 | θi, aj) decays towards 0 when the distance increases. In what follows, we provide two remarks on this modeling framework.

Remark 1 We remark on the link function f which plays a similar role as the dissimilarity transformation function in the classical MDS and MDU methods (e.g., Chapter 9, Borg and Groenen, 2005). Assuming a pre-speciﬁed f is similar to assuming an identity transformation in classical MDU. In classical MDS and MDU, the dissimilarity transformation function can be unknown and estimated from data parametrically or non-parametrically. Similar treatment can be applied to the link function f. For example, one may assume

f( θi aj 2) = g(β0 + β1 θi aj 2),

where g : R [0, 1] is a given monotone decreasing function and β0 and β1 are additional parameters to be estimated from data together with the personand item-speciﬁc parameters. This form is similar in spirit to the interval transformation in classical MDU. When no constraint is imposed on the scales of θis and ajs, β1 needs to be ﬁxed to be a constant (e.g., β1 = 1) for model identiﬁability. One may also estimate f non-parametrically, for example, by using monotone splines.

Unfolding-Model-Based Visualization: Theory, Method and Applications

Under suitable regularity conditions, our theoretical development in Section 3 can be extended to the case when f also needs to be estimated from data.

Remark 2 Although we focus on binary data, the introduced modeling framework can be easily extended to other types of preference data, such as rating and ranking data. For example, consider rating data Yij {0, ..., T}, where 0, 1, ..., T are T + 1 ordered response categories. A higher category implies a higher level of agreement between the respondent and the item. Then one can assume the following unfolding model

P (Yij t | θi, aj) = g dt + θi aj 2 , (2)

for t {1, ..., T}, where g : R [0, 1] is a given monotone decreasing function and d1, ..., d T are additional model parameters. It implies that the larger the distance, the smaller the probability for Yij to take a large value. This model is closely related to the graded response model (Samejima, 1997) in item response theory. For another example, consider ranking data consisting of pair-wise comparisons, where each response is a comparison between two items j and j . Following the same idea as above, one may model the probability that item j is preferred over j to take the form g( θi aj 2 θi aj 2). That is, the probability decreases with the diﬀerence of their squared distances to person i. Our theoretical results and computational algorithm given below can be adapted to these situations.

2.2 Recovery of Conﬁguration

Our main goal is the simultaneous recovery of the ideal points θi and aj, based on the observed binary responses yij, i = 1, ..., N, j = 1, ..., J. Since the model only relies on the Euclidian distance between the ideal points, two sets of points lead to the same model if they have the same conﬁguration, i.e., one set of points can be obtained by applying an isometry mapping to the other. This is because, the distance between points is invariant under an isometry mapping. An isometry mapping F in RK takes the form

F(x) = Ox + b, x RK,

where O is a K K orthogonal matrix and b is a vector in RK (see, e.g., Olver, 1999). We further denote AK as the set of all isometry mappings on RK. Without additional information, the best possible result one can expect is recovering the ideal points up to an isometry mapping. We refer to this problem as the recovery of ideal point conﬁguration. It is worth noting that regularity conditions are needed to ensure the recovery of the conﬁguration. That is, it is possible that there exist multiple sets of ideal points with diﬀerent conﬁgurations that lead to the same distribution of Yijs. In other words, the conﬁguration of {θ1, ..., θN, a1, ..., a J} may not be unique only given the partial distance matrix. This is known as the situation of degeneration, in which case the visualization does not convey information or can even be misleading. A simple example is given in Figure 2, where the two diﬀerent conﬁgurations in the two panels have the same partial distance matrix. Following the above discussion, the validity of unfolding-model-based visualization relies on the accuracy of conﬁguration recovery, a problem to be discussed. Speciﬁcally, we

Chen, Ying and Zhang

Figure 2: An example of degenerate situation: The triangles represent item points and circles represent person points. The two conﬁgurations in R2 share the same partial distance matrix, where d11 = 12, d12 = 32, d21 = 22, d22 = 22, d31 = 32, and d32 = 12.

consider the following loss function for conﬁguration recovery,

PN i=1 θ i F(ˆθi) 2

PJ j=1 a j F(ˆaj) 2

where θ i and a j denote the true ideal points and ˆθi and ˆaj denote the estimates from data (yij)N J. Note that (3) quantiﬁes the accuracy of conﬁguration recovery in an average sense, where isometry indeterminacy is bypassed by the minimization in (3) with respect to all isometry mappings in AK. We call (3) the average loss for the recovery of ideal point conﬁguration. Error bounds will be established for (3) under reasonable conditions, which ensures the accurate recovery of the loss function when both N and J are large.

2.3 Connection with Other Scaling Methods

MDU is closely related to MDS, a class of methods for visualizing the similarity pattern between data points (Borg and Groenen, 2005). More precisely, MDS maps a set of variables onto a low dimensional space, based on data measuring the similarity between variables. As pointed out in Chapter 14, Borg and Groenen (2005), MDU can be viewed as a special case of MDS, where the set of variables in MDS composes of both the respondents and items and the item response data (yij)N J are regarded as measures of similarity between the respondents and the items, while the similarities within the two sets (i.e., respondents and items) are structurally missing; see Figure 3 for an illustration that is a reproduction of Figure 14.1 of Borg and Groenen (2005). Little statistical theory has been developed for the recovery of conﬁguration based on MDS models. The most relevant work is Zhang et al. (2016), in which an error bound is developed for the recovery of the complete distance matrix, under a linear MDS model without structurally missing data. However, little discussion is provided on the recovery of ideal point conﬁguration, under an MDU setting. The recovery of conﬁguration is relatively easier under the setting of MDS with no structurally missing data. This is because, the complete data matrix of similarities will provide suﬃcient information on the complete distance matrix. The accurate recovery of the complete distance matrix further implies the accurate recovery of conﬁguration under weak conditions, due to the one-to-one relationship between the complete distance matrix

Unfolding-Model-Based Visualization: Theory, Method and Applications

Figure 3: In MDU, the diagonal blocks are missing. All we observe are the oﬀ-diagonal blocks.

and the ideal point conﬁguration as described in Proposition 3. Under the MDU setting, the recovery of conﬁguration requires additional regularity conditions, due to the lack of direct measurement of within-set distances.

Proposition 3 For {x1, ..., xn} RK, {y1, ..., yn} RK, if xi xj = yi yj for all i and j, then there exists an isometry mapping F AK such that F(xi) = yi for all i = 1, ..., n.

MDU is also related to other scaling methods for binary data such as item response theory (IRT; Embretson and Reise, 2000; Reckase, 2009) and multiple correspondence analysis (Giﬁ, 1990; Le Roux and Rouanet, 2010). Speciﬁcally, probabilistic models are available from IRT for multivariate binary data. An IRT model also represents respondents and items by low-dimensional parameter vectors, say θi and aj. It also assumes that the probability of Yij = 1 is a function of θi and aj. In this sense, the model introduced above can be viewed as a special IRT model, in which the probability of Yij = 1 is assumed to be a monotone decreasing function of θi aj . However, the classical IRT models (see e.g., Embretson and Reise, 2000; Reckase, 2009) are not speciﬁed in this way. Consequently, it does not make sense to visualize the person and item parameter vectors jointly. Multiple correspondence analysis is an algorithm-based approach that can be applied to binary data and produce low-dimensional scores for both respondents and items. These score vectors can be plotted jointly in the same space. However, as a common issue with algorithm-based approaches, the meaning of the distance between the score vectors is not clear and the uncertainty associated with the visualization is hard to quantify.

3. Theoretical Results

3.1 Conﬁguration Recovery based on Perturbed Partial Distances

We ﬁrst study the recovery of conﬁguration from a perturbed partial distance matrix, when both N and J grow to inﬁnity. Let θ i , i = 1, ..., N, and a j, j = 1, ..., J be the true person and item ideal points in RK, respectively, and let D N,J be the corresponding partial distance

Chen, Ying and Zhang

matrix. In addition, let θi RK+ and aj RK+ correspond to a perturbed version of the true conﬁguration, satisfying

DN,J D N,J 2 F = o(NJ) (4)

and K+ K, where DN,J denotes the partial distance matrix given by the perturbed conﬁguration and F denotes the matrix Frobenius norm. One can think of K+ as the latent dimension of the MDU model being applied to data, and θi and aj as some estimates of the person and item ideal points. For the time being, we treat K+, θi, and aj as given. Based on the deﬁnition of matrix Frobenius norm, the left side of (4) has NJ terms, each of which is a squared distance between a true person-item distance and its perturbed value. Equation (4) implies that the perturbed partial distance matrix converges to the true one in an average sense, when both N and J grow to inﬁnity. We denote θ+ i = ((θ i ) , 0 ) and a+ j = ((a j) , 0 ) in RK+ as the embedding of the true ideal points in RK+, where 0 denotes a zero vector. In what follows, we show that

PN i=1 θ+ i F( θi) 2

PJ j=1 a+ j F( aj) 2

as N and J grow to inﬁnity, under reasonable conditions on the true ideal points. Throughout this paper, we assume that ideal points are constrained in a compact set in RK.

A0. There exists a constant M such that θ i M and a j M for all i and j.

To impose regularity conditions on the true conﬁguration of the N + J ideal points, which can vary with N and J, we introduce the notion of anchor points, two ﬁnite sets of points in RK satisfying certain regularities that are independent of N and J.

Deﬁnition 4 Two sets of points, {b 1, ..., b k1}, {c 1, ..., c k2} BK 0 (M), are called a collection of anchor points of RK, if they satisfy conditions A1 and A2 below, where BK 0 (M) denotes a closed ball in RK centered at 0 with radius M.

Let D = ( b i c j 2)k1 k2 be the partial distance matrix based on the anchor points, whose entries are assumed to be all positive (i.e., there is no identical points).

A1. There exists η > 0 such that for any partial distance matrix D Rk1 k2 satisfying D D F < η, D has a unique conﬁguration.

A2. Both {b 1, ..., b k1} and {c 1, ..., c k2} can aﬃne span RK.

Remark 5 According to Deﬁnition 4, we still get a collection of anchor points when slightly perturbing the points in a given anchor point collection in RK.

According to condition A1, the anchor points are well-behaved points whose conﬁguration can be uniquely determined by the partial distance matrix, even after a small perturbation. In addition, thanks to A2, the anchor points will help to anchor the rest of

Unfolding-Model-Based Visualization: Theory, Method and Applications

the points in RK, i.e., determining the conﬁguration of a larger set of respondent and item ideal points. Following the above concept of anchor points, it is intuitive that if there exist anchor points {b 1, ..., b k1} and {c 1, ..., c k2}, satisfying that each b i is surrounded by suﬃciently many respondent ideal points and each c j is surrounded by suﬃciently many item ideal points; that is, there exist a suﬃcient number of anchor points. Then it is relatively easy to recover the conﬁguration of the ideal points from a perturbed partial distance matrix. This intuition is formalized by condition A3 below.

A3. There exists a collection of anchor points {b 1, ..., b k1} and {c 1, ..., c k2} BK 0 (M) RK and 0 < ϵ < M/10 such that the closed balls with b 1, ..., b k1 and c 1, ..., c k2 as centers and radius ϵ, denoted by Bb 1(ϵ), ..., Bb k1(ϵ) and Bc 1(ϵ), ..., Bc k2(ϵ), do not overlap. The following two conditions are required to hold.

(1) For any b1 Bb 1(ϵ), ..., bk1 Bb k1(ϵ) and c1 Bc 1(ϵ), ..., ck2 Bc k2(ϵ), {b1, ..., bk1} and {c1, ..., ck2} are also a collection of anchor points.

(2) When N and J grow to inﬁnity,

pi = lim inf N

PN l=1 1{ θ l b i <ϵ} N > 0, i = 1, ..., k1,

qj = lim inf J

PJ l=1 1{ a l c j <ϵ} J > 0, j = 1, ..., k2.

Theorem 6 Suppose that A0 and A3 are satisﬁed for the true ideal points θ i and a j, i =

1, ..., N, j = 1, ..., J. Let θi, aj BK+ 0 (M) correspond to a perturbed version of the true conﬁguration, for some K+ K. Further let DN,J be the corresponding partial distance matrix. Suppose that DN,J D N,J 2 F = o(NJ), when N and J grow to inﬁnity. Then

lim sup N,J

PN i=1 θ+ i F( θi) 2

PJ j=1 a+ j F( aj) 2

where C is a constant that does not depend on N and J. If there exists a ﬁxed collection of anchor points, for which A3 is satisﬁed for any suﬃciently small ϵ > 0, then we have

lim sup N,J

PN i=1 θ+ i F( θi) 2

PJ j=1 a+ j F( aj) 2

Remark 7 Theorem 6 shows that the conﬁguration can be recovered asymptotically when both N and J grow to inﬁnity and suitable conditions hold. The conditions required by Theorem 6 are quite mild. It ﬁrst requires all the true and perturbed ideal points to be located in a compact set. Second, as will be shown in Proposition 8 below, condition A3 is satisﬁed with high probability when the true person and item points are i.i.d. samples from two distributions satisfying mild conditions, respectively. Finally, it requires that the perturbation of the partial distance matrix is not too large, i.e., DN,J D N,J 2 F = o(NJ). As will be shown in Proposition 11, this condition holds with high probability when DN,J is given by a likelihood-based estimator.

Chen, Ying and Zhang

Proposition 8 Suppose that θ 1, ..., θ N and a 1, ..., a J are independent and identically distributed samples from distributions P1 and P2, where P1 and P2 have positive and continuous density functions within a ball G BK 0 (M). Then A3 holds almost surely for any suﬃciently small ϵ > 0.

Remark 9 We remark that constant C is determined and only determined by the conﬁguration of the anchor points in A3, according to our proof in the supplementary material. Roughly, the more regular the set of anchor points is (in terms of aﬃne spanning RK), the smaller the value of C.

Remark 10 As discussed in Section 2.2, we can only recover the ideal points up to an isometry mapping. This isometry mapping may be ﬁxed if one is willing to make further assumptions such as non-negativity (Donoho and Stodden, 2004; Hoyer, 2004) and sparsity (Chen et al., 2020). In that case, one may further interpret each coordinate of the latent space. We leave this problem for future investigation.

3.2 Likelihood-based Estimation

In what follows, we propose a constrained maximum likelihood estimator and show its properties. Given the assumptions of the MDU model, our likelihood function takes the form

L(θ1, ..., θN, a1, ..., a J) =

j=1 f( θi aj 2)yij(1 f( θi aj 2)1 yij.

Based on this likelihood function, we consider the following estimator

(ˆθ1, ..., ˆθN, ˆa1, ..., ˆa J) = arg min θ1,...,θN,a1,...,a J RK+ log L(θ1, ..., θN, a1, ..., a J)

s.t. θi M, aj M, i = 1, ..., N, j = 1, ..., J. (7)

where K+ and M are pre-speciﬁed. We denote ˆDN,J as the partial distance matrix based on ˆθis and ˆajs from (7). We impose the following regularity condition on the link function f, which requires f to be neither too steep nor too ﬂat in the feasible domain. Similar conditions are assumed in Davenport et al. (2014) for solving a 1-bit matrix completion problem.

A4. The link function f : R (0, 1) is a smooth and monotone decreasing function, satisfying L4M2 < and β4M2 < , where

Lα = sup |x| α

|f (x)| f(x)(1 f(x)), and βα = sup |x| α

f(x)(1 f(x))

Proposition 11 Suppose that A0 and A4 are satisﬁed and K+ K. Then there exist C1 and C2 independent of N and J, such that

1 NJ ˆDN,J D N,J 2 F C1M2L4M2β4M2

1 + log(NJ)

with probability at least 1 C2/(N + J).

Unfolding-Model-Based Visualization: Theory, Method and Applications

Proposition 11 implies that ˆDN,J D N,J 2 F = op(NJ), which, combined with Theorem 6, leads to Theorem 12 below.

Theorem 12 Suppose that A0, A3 and A4 are satisﬁed and K+ K. Then

PN i=1 θ+ i F(ˆθi) 2

PJ j=1 a+ j F(ˆaj) 2

where ˆθi and ˆaj, i = 1, ..., N, and j = 1, ..., J, are given by (7), ϵ is from condition A3, and C is a constant independent of ϵ, N, and J.

Remark 13 We remark that if A3 holds for any suﬃciently small ϵ, then (8) implies that the loss

PN i=1 θ+ i F(ˆθi) 2

PJ j=1 a+ j F(ˆaj) 2

converges to zero in probability. Further note that according to Proposition 8, A3 holds with high probability for any suﬃciently small ϵ, under a random design for the true ideal points. Therefore, the loss can be shown to converge to zero in probability, under this random design. This result is summarized in Theorem 14 below.

Theorem 14 Suppose that A0 and A4 are satisﬁed and K+ K. Further suppose that θ 1, ..., θ N and a 1, ..., a J are independent and identically distributed samples from distributions P1 and P2, respectively, where P1 and P2 have positive and continuous density functions within a ball G BK 0 (M). Then for ˆθi and ˆaj, i = 1, ..., N, and j = 1, ..., J, given by (7), the loss function

PN i=1 θ+ i F(ˆθi) 2

PJ j=1 a+ j F(ˆaj) 2

goes to 0 in probability as N and J grow to inﬁnity.

Remark 15 We remark that the probability measures in Theorems 12 and 14 are slightly diﬀerent. The probability in Theorem 12 is based on the conditional distribution of Yijs given θ i and a j, while that for Theorem 14 is based on the joint distribution of Yij, θ i and a j, i = 1, ..., N, j = 1, ..., J.

Remark 16 A stress function is a squared error loss function that plays an important role in the classical MDS/MDU algorithms. It serves not only as the objective function in the search for the MDS/MDU solution, but also as the basis for assessing the goodness-of-ﬁt of the solution (Mair et al., 2016). In the proposed framework, the negative joint log-likelihood function plays a similar role as the stress function. It replaces the squared loss in the stress function by a loss function based on the Kullback-Leibler divergence. Similar goodness-of-ﬁt measures in classical MDU can be developed under the proposed framework, based on the negative joint log-likelihood.

Chen, Ying and Zhang

Remark 17 We remark on the choice of latent dimension. Theorems 12 and 14 suggest that as long as we choose K+ to be no less than the true dimension K, then the unfolding result is asymptotically valid. When there is no such prior knowledge about an upper bound of K, one can estimate the latent dimension K using data. Several methods from factor analysis and network data analysis may be adapted to the current problem, such as tracenorm regularization (Bach, 2008), cross-validation (Chen and Lei, 2018; Li et al., 2020), and information criteria (Bai and Ng, 2002). We believe that consistency results on the selection of K can be established.

Remark 18 We point out that the result of Proposition 11 can be easily extended to other MDU models, such as models with additional parameters in the link function and models for rating and ranking data. Then, by making use of Theorem 6, the results of Theorems 12 and 14 can also be extended to these models.

We propose an alternating minimization algorithm for solving (7). To handle the constraints in (7), a projected gradient descent update is used in each iteration. For x RK+, we deﬁne the following projection operator:

Proc M(x) = arg min y M y x =

( x if x M, Mx/ x if x > M.

Algorithm 1 (Alternating minimization algorithm)

Input: Data (yij)N J, pre-speciﬁed dimension K+, constraint M, iteration number m = 1, and the initial values θ(0) 1 , ..., θ(0) N and a(0) 1 , ..., a(0) J in RK+.

Alternating minimization: at the mth iteration, perform

(a) For each respondent i, update

θ(m) i = Proc M θ(m 1) i + ϱs(m 1) i (θ(m 1) i ) ,

s(m 1) i (θ)

j=1 yij log f( θ a(m 1) j 2) + (1 yij) log 1 f( θ a(m 1) j 2)

The step size ϱ > 0 is chosen by line search. (b) For each item j, update

a(m) j = Proc M a(m 1) j + ϱ s(m 1) j (a(m 1) j ) ,

s(m 1) j (a)

i=1 yij log f( θ(m) i a 2) + (1 yij) log 1 f( θ(m) i a 2) !

The step size ϱ > 0 is chosen by line search.

Unfolding-Model-Based Visualization: Theory, Method and Applications

Iteratively perform steps (a) and (b) until convergence. Let m be the last iteration number upon convergence.

Output: ˆθ1 = θ(m ) 1 , ..., ˆθN = θ(m ) N and ˆa1 = a(m ) 1 , ..., ˆa J = a(m ) J .

Remark 19 Since (7) is not a convex optimization problem, there is no guarantee that Algorithm 1 ﬁnds the global optimal solution. However, we point out that the previous theoretical results hold even when {ˆθ1, ..., ˆθN, ˆa1, ..., ˆa J} is not a global optimal point. Speciﬁcally, Proposition 11 and Theorems 12 and 14 hold for any {ˆθ1, ..., ˆθN, ˆa1, ..., ˆa J} satisfying the constraints in (7) and

L(ˆθ1, ..., ˆθN, ˆa1, ..., ˆa J) L(θ 1, ..., θ N, a 1, ..., a J). (9)

According to our simulation study, estimates given by Algorithm 1 are likely to satisfy (9).

3.3 Analyzing Missing Data

We further discuss the conﬁguration recovery problem when data have many missing values, which is commonly encountered in practice. Denote matrix Ω= (ωij)N J, where ωij = 1 indicates that response yij is observed and ωij = 0 indicates yij is missing. We consider the simple case of uniformly missing, as described in condition A5. We point out that this assumption can be relaxed to analyzing data that have non-uniformly missing entries, following the developments in Cai and Zhou (2013) for solving a 1-bit matrix completion problem.

A5. Entries of Ω, ωij, are independent and identically distributed Bernoulli random variables with P(ωij = 1) = n NJ .

Under this condition, there are on average n entries of the data matrix (yij)N J that are observable. Thanks to the ignorable missingness, given Ωand the observed data, the likelihood becomes

LΩ(θ1, ..., θN, a1, ..., a J) = Y

ωij=1 f( θi aj 2)yij(1 f( θi aj 2)1 yij.

We still consider a constrained maximum likelihood estimator

(ˆθ Ω 1 , ..., ˆθ Ω N, ˆaΩ 1 , ..., ˆaΩ J ) = arg min θ1,...,θN,a1,...,a J RK+ log LΩ(θ1, ..., θN, a1, ..., a J)

s.t. θi M, aj M, i = 1, ..., N, j = 1, ..., J. (10)

Let ˆDΩ N,J denote the partial distance matrix for ˆθ Ω 1 , ..., ˆθ Ω N, ˆaΩ 1 , ..., ˆaΩ J . Proposition 20 presents a missing-data version of Proposition 11. It implies that we can still recover the partial distance matrix if n is large enough.

Chen, Ying and Zhang

Proposition 20 Suppose that A0, A4 and A5 are satisﬁed and K+ K. Then there exist C1 and C2 independent of N and J, such that

1 NJ ˆDΩ N,J D N,J 2 F C1M2L4M2β4M2

1 + NJ log(NJ)

n(N + J) (11)

with probability at least 1 C2/(N + J).

Remark 21 If n > (N + J) log(NJ), then the right side of (11) goes to 0 as N and J grow to inﬁnity, which means ˆDΩ N,J D N,J 2 F = op(NJ). Following the discussion in Sec-

tion 3, {ˆθ Ω 1 , ..., ˆθ Ω N, ˆaΩ 1 , ..., ˆaΩ J } provides a consistent estimate of the ideal point conﬁguration. This consistency result is summarized in Proposition 22, which is a missing-data version of Theorem 14 under the random design.

Proposition 22 Suppose that A0, A4 and A5 are satisﬁed, and K+ K, and n > (N + J) log(NJ). Further suppose that θ 1, ..., θ N and a 1, ..., a J are independent and identically distributed samples from distributions P1 and P2, where P1 and P2 have positive and continuous density functions within a ball G BK 0 (M). Then the loss function

PN i=1 θ+ i F(ˆθ Ω i ) 2

PJ j=1 a+ j F(ˆaΩ j ) 2

goes to 0 in probability as N and J grow to inﬁnity.

4. Simulation Studies

In what follows, simulation studies are conducted to verify our theoretical results. Specifically, we consider a random design where the true ideal points are generated from distributions. All the analyses in this section, as well as those in Section 5, are based on our implementation of Algorithm 1 in statistical software R1.

4.1 Study I

Setting. We ﬁrst consider a setting where K+ is chosen to be exactly K. We consider MDU in a two-dimensional latent space, i.e., K = K+ = 2. Diverging sequences of J and N are considered, by letting J = 200, 400, ..., 1000 and N = 20J. For given N and J, 100 independent datasets are generated. For each dataset, we ﬁrst sample θ i s and a js uniformly from B2 0(1), a ball in R2 with center 0 and radius 1. Then given the ideal points, response data Yij are generated under the link function f(x) = 2/(1 + exp(x + 0.1)). It can be easily veriﬁed that condition A4 is satisﬁed for this link function. For each dataset, we obtain an estimate of the ideal points, by applying Algorithm 1 ten times with random starting points and then choosing the result that gives the largest likelihood function value. The use of multiple starting points substantially reduces the risk of the algorithm converging to bad local minima. In the application of Algorithm 1, the constraint M is set to 1.5.

1. An R package has been developed and can be downloaded from https://github.com/hrzhang16/mmdu.

Unfolding-Model-Based Visualization: Theory, Method and Applications

J = 200 J = 400 J = 600 J = 800 J = 1000

25% 0.0630 0.0323 0.0218 0.0164 0.0131 median 0.0647 0.0328 0.0222 0.0167 0.0134 75% 0.0666 0.0336 0.0227 0.0170 0.0135

Table 1: Simulation Study I: The average squared Frobenius loss of partial distance when J increases from 200 to 1000. For each J, the table shows the 25%, 50% and 75% quantiles of the loss based on 100 independent experiments.

J = 200 J = 400 J = 600 J = 800 J = 1000

25% 0.0158 0.0079 0.0053 0.0040 0.0032 median 0.0160 0.0080 0.0053 0.0040 0.0032 75% 0.0162 0.0080 0.0054 0.0040 0.0032

Table 2: Simulation Study I: The average loss for conﬁguration recovery when J increases from 200 to 1000. For each J, the table shows the 25%, 50% and 75% quantiles of the loss based on 100 independent experiments.

Results. We ﬁrst check the obtained likelihood function values for the 100 datasets. As we point out in Remark 19, Proposition 11 and Theorems 12 and 14 still hold as long as the estimate satisﬁes (9), even if the global solution to the optimization (7) is not obtained. It is found that by using ten random starting points, the likelihood function at the estimated parameters is always larger than that at the true parameters for all the 100 datasets. We then present the average squared Frobenius loss for the recovery of the partial distance matrix, ˆDN,J D N,J 2 F /(NJ). These results are given in Table 1 which presents the 25%, 50%, and 75% quantiles of the loss based on the 100 datasets. From this table, we see that the loss tends to decrease as the sample size increases, supporting the result of Proposition 11. Table 2 presents the results on loss (3) for conﬁguration recovery, where the best isometry mapping F in (3) is obtained by solving an optimization problem given the true and estimated ideal points. Similar to the results on partial distance matrix recovery, the loss (3) also decreases towards 0 as J grows large, which is consistent with the result of Theorem 14. Finally, the computation time on a standard desktop machine2 for solving (7) is shown in Table 3. It is worth pointing out that since the update of person and item parameters in each iteration of Algorithm 1 can be run in parallel, the computation can be further speeded up substantially by parallel computing.

4.2 Study II

Setting. We now consider a setting where K+ > K. We take the same setting as in Study I, except that we set K+ = 3 when ﬁtting the MDU model. The same as Study I, for each

2. All the computation is conducted on a single Intel Gold 6130 core.

Chen, Ying and Zhang

J = 200 J = 400 J = 600 J = 800 J = 1000

25% 98.2 109.6 144.1 191.5 254.6 median 113.8 120.1 156.0 201.5 272.6 75% 128.9 138.0 176.9 213.9 286.7

Table 3: Simulation Study I: The computation time of optimization (7) when J increases from 200 to 1000. For each J, 25%, 50% and 75% quantiles of the computation time from 100 independent experiments are shown.

J = 200 J = 400 J = 600 J = 800 J = 1000

25% 0.0734 0.0384 0.0261 0.0198 0.0159 median 0.0758 0.0390 0.0265 0.0200 0.0161 75% 0.0780 0.0398 0.0269 0.0204 0.0163

Table 4: Simulation Study II: The average squared Frobenius loss of partial distance when J increases from 200 to 1000. For each J, the table shows the 25%, 50% and 75% quantiles of the loss based on 100 independent experiments.

pair of N and J, 100 independent datasets are generated. For each dataset, Algorithm 1 is applied similarly, using 10 random starting points and constraint parameter M = 1.5.

Results. The results are given in Tables 4 through 6. Similar to Tables 1 3, these three tables also show the results on partial distance matrix recovery, conﬁguration recovery, and computation time, respectively. Comparing with the results of Study I, we see that both losses for the recovery of partial distance matrix and conﬁguration tend to be larger. This is due to the overﬁtting brought by adding unnecessary parameters in the model. The computation time also increases compared with that of Study I.

5. Real Examples

5.1 Example I: Movie Data

Background. We apply MDU to a movie rating dataset from the famous Movie Lens project (see e.g., Harper and Konstan, 2016). The dataset analyzed in this paper is a subset

J = 200 J = 400 J = 600 J = 800 J = 1000

25% 0.0853 0.0568 0.0452 0.0386 0.0343 median 0.0862 0.0573 0.0455 0.0390 0.0345 75% 0.0877 0.0580 0.0459 0.0392 0.0346

Table 5: Simulation Study II: The average loss for conﬁguration recovery when J increases from 200 to 1000. For each J, the table shows the 25%, 50% and 75% quantiles of the loss based on 100 independent experiments.

Unfolding-Model-Based Visualization: Theory, Method and Applications

J = 200 J = 400 J = 600 J = 800 J = 1000

25% 106.3 264.0 639.3 1294.0 2302.5 median 110.0 286.5 698.8 1407.7 2480.8 75% 112.9 308.9 793.8 1551.0 2841.1

Table 6: Simulation Study II: The computation time of optimization (7) when J increases from 200 to 1000. For each J, 25%, 50% and 75% quantiles of the computation time from 100 independent experiments are shown.

of a benchmark Movie Lens dataset collected during a seven-month period from September, 1997 through April, 19983. This subset contains 943 users and 338 movies, obtained by selecting movies that have been rated by at least 100 users. Unlike many analyses of Movie Lens data that focus on the rating scores, we consider to unfold the rating behavior itself (i.e., rated/not rated) which may also reveal the users preference patterns. More precisely, we let Yij = 1 if movie j has been rated by user i and Yij = 0 otherwise.

Analysis. For visualization purpose, we unfold the data onto a two-dimensional space. To apply the MDU model introduced in this paper, we need to specify the link function f. We assume f to take the logistic form f(x) = 2/(1 + exp(x + δ)), where δ is a pre-speciﬁed small positive constant. For any δ > 0, it is easy to check that the regularity condition A4 is satisﬁed. The results presented below are based on the choice δ = 0.1, but we point out that other choices of δ (δ = 0.05, 0.15, 0.2) have also been tried which all lead to very similar results. The constraint constant M is set to 3.5 when applying Algorithm 1. After obtaining the estimate, we transform the estimated ideal points by an isometry mapping, so that the x-axis corresponds to the dimension along which the estimated movie ideal points have the highest variance. As will be described in the sequel, under this isometry mapping of the estimated ideal points, both the xand y-axes receive good interpretations.

Results. The results from the MDU analysis are presented in Figures 4 through 6. Figure 4 jointly visualizes the estimated movie and user points. As we can see, the movies and the users tend to form two giant clusters that only slightly overlap. We investigate the movie points. First, the y-axis of the space largely indicates, if not perfectly, the popularity of the movies. The movies with a smaller ˆaj2 value tends to be rated more frequently. Roughly speaking, the shorter the average distance from a movie to the user points, the more often the movie is rated. In fact, the Kendall s tau rank correlation between ˆaj2s and the numbers of ratings received by the movies is 0.66. This phenomenon is further reﬂected by panel (a) of Figure 5, where movies are stratiﬁed by the numbers of ratings they received into four categories. These four categories tend to be ordered along the y-axis. We list four movies as examples, as indicated in panel (a) of Figure 5. From the top to the bottom, they are Batman Forever (1995), Golden Eye (1995), Get Shorty (1995) and The Godfather (1972), respectively. Based on our interpretation of the y-axis, these four movies are ordered from the least popular to the most popular.

3. The dataset can be downloaded from https://grouplens.org/datasets/movielens/100k/

Chen, Ying and Zhang

Figure 4: Analysis of movie rating data: Simultaneous visualization of the estimated movie and user points.

Second, the x-axis of the space seems to indicate the release time of the movies. The Kendall s tau correlation between ˆaj1s and the release dates of the movies is -0.70. As shown in panel (b) of Figure 5, where the movies are stratiﬁed into three categories, namely before 1995 , 1995-1996 , and 1997-1998 . According to this ﬁgure, the clustering patten of the movies can be largely explained by the three categories based on the movie release dates. From the right to the left of the space, the points correspond to movies from the relatively older ones to the relatively more recent ones. For example, the three movies indicated in panel (b) of Figure 5 are, from right to left, Citizen Kane (1941), Twelve Monkeys (1995) and The Devil s Own (1997), respectively. The interpretation of the latent space based on movies facilitates the interpretation of the user points. First, the y-axis corresponds to the users activeness. Roughly speaking, the shorter the average distance from a user point to the movies points, the more active the user is. The Kendall s tau rank correlation between ˆθi2s and the numbers of ratings given by the users is 0.73. This is further shown via Figure 6, where users are classiﬁed into four equal-size groups depending on the number of movies they rated. These groups of users, from the most active one to the least active one, lie from the top to the bottom. Second, based on the alignment of movies along the x-axis, the user points from right to left may be interpreted as the ones who tend to more frequently rate relatively older movies to the ones who tend to more frequently rate relatively more recent ones.

5.2 Example II: Senate Roll Call Voting Data

Background. We now analyze a senate roll call voting dataset from the 108th congress. This dataset contains the voting records from 100 senators to 675 roll calls in years 2003

Unfolding-Model-Based Visualization: Theory, Method and Applications

2 1 0 1 2 3

1.0 0.5 0.0 0.5 1.0 1.5

<=127 128 169 170 229 >=230 examples

2 1 0 1 2 3

1.0 0.5 0.0 0.5 1.0 1.5

1997 1998 1995 1996 before 1995 examples

Figure 5: Analysis of movie rating data. Panel (a): Visualization of movie points, with movies stratiﬁed into four equal-size categories based on the numbers of rating. Movies with numbers of rating less than 127, 128-169, 170-229 and more than 230 are indicated by black, red, green and blue points, respectively. Yellow points represent example movies. Panel (b): Visualization of movie points, with movies stratiﬁed into three categories based on their release time. Movies released in 1997-1998, 1995-1996, and before 1995 are indicated by green, red and black points, respectively. Purple points represent example movies.

Chen, Ying and Zhang

4 3 2 1 0 1 2

2.5 1.5 0.5 0.5 1.0

<=24 25 47 48 103 >=104

Figure 6: Analysis of movie rating data: Visualization of user points, with users classiﬁed into four equal-size categories based on the numbers of rating. Users who rated less than 24, , 25-47, 48-103 and more than 104 movies are indicated by black, red, green and blue points, respectively.

and 20044. Among the 100 senators, there are 48 from the Democratic party, 51 from the Republican party, and one independent politician. For each roll call j, the vote of senator i is recorded in three ways, Yea , Nay and Not Voting , treated as Yij = 1, 0, and missing, respectively.

Analysis. Similar analysis as the previous one is conducted. Speciﬁcally, we unfold the data into a two-dimensional space. The same link function f and constraint constant M are adopted as in the analysis of movie data. After getting the estimate, we transform the estimated ideal points by an isometry mapping, so that the x-axis corresponds to the dimension along which the estimated senate ideal points have the highest variance.

Results. The results are presented in Figures 7 through 9. In Figure 7, the ideal points of both roll calls and senators are visualized simultaneously. As we can see, most of the roll calls and all the senators tend to lie around a one-dimensional line. This visualization is still valid, in the sense that even when the true latent dimension is one, according to Theorem 14, unfolding the data in a two-dimensional space is still consistent. This phenomenon of degeneration is quite consistent with the overall unidimensional patten in the congress voting data throughout the history. It has been well recognized in the political science literature (Poole et al., 1991; Poole and Rosenthal, 1991) that senate voting behavior is essentially unidimensional, though slightly diﬀerent latent space models are used in that literature. For example, Poole et al. (1991) concluded that to the extent

4. The dataset can be downloaded from https://legacy.voteview.com/dwnl.htm.

Unfolding-Model-Based Visualization: Theory, Method and Applications

senator roll call

Figure 7: Analysis of senator roll call data: Simultaneous visualization of the estimated senator and roll call ideal points.

that congressional voting can be described by a spatial model, a unidimensional model is largely (albeit not entirely) suﬃcient.

We ﬁrst interpret the senators. In Figure 8, all the senator points are visualized with their party membership indicated by diﬀerent point types. In Table 7, we rank the senators based on their value of ˆθi1, which is presented along the x-axis. According to this table, the Democrats tend to lie on the left and the Republicans tend to be on the right. In fact, this ranking is largely consistent with National Journal s liberalness ranking of the senators in 2003. National Journal s ranking result, which is replicated in Clinton et al. (2004b), is obtained by unfolding the senators votes on 62 key roll calls using a model given in Clinton et al. (2004a). The Kendall s tau rank correlation between the result in Table 7 and that given by the National Journal is 0.79. In fact, Senator John Kerry is ranked the most liberal by both our model and by National Journal and Senator Craig L. Thomas, who is the most conservative senator according to the ranking of National Journal, is the third most conservative senator given by our model.

From Figure 8 and Table 7, it is also worth noting that there is a Democrat whose estimated ideal point is mixed together with those of the Republicans. This senator is Zell Miller from the state of Georgia. He is a conservative Democrat and in fact, he supported Republican President George W. Bush against the Democratic nominee John Kerry in the presidential election in 2004.

In this congress, there is an independent senator, Jim Jeﬀords from the state of Vermont, who does not belong to either of the two major parties. As we can see from both Figure 8 and Table 7, his ideal point lies on the left, mixed with many ideal points of the Democrats. This is also consistent with Senator Jim Jeﬀords political standing. In fact, he left Republican party to become an independent and began caucusing with the Democrats since 2001.

Chen, Ying and Zhang

Name State Name State Name State 1 Kerry D-MA 35 Johnson D-SD 69 Grassley R-IO 2 Sarbanes D-MD 36 Lieberman D-CT 70 Bond R-MO 3 Reed D-RH 37 Bingaman D-NM 71 Roberts R-KA 4 Harkin D-IO 38 Nelson D-FL 72 Gregg R-NH 5 Graham D-FL 39 Dorgan D-ND 73 Allen R-VI 6 Lautenberg D-NJ 40 Conrad D-ND 74 Domenici R-NM 7 Edwards D-NC 41 Carper D-DE 75 Bennett R-UT 8 Kennedy D-MA 42 Pryor D-AR 76 Dole R-NC 9 Durbin D-IL 43 Bayh D-IN 77 Frist R-TN 10 Levin D-MI 44 Lincoln D-AR 78 Brownback R-KA 11 Akaka D-HA 45 Landrieu D-LO 79 Hatch R-UT 12 Byrd D-WE 46 Baucus D-MT 80 Cochran R-MS 13 Boxer D-CA 47 Breaux D-LO 81 Graham R-SC 14 Corzine D-NJ 48 Nelson D-NE 82 Alexander R-TN 15 Clinton D-NY 49 Chafee R-RH 83 Lott R-MS 16 Leahy D-VE 50 Snowe R-ME 84 Chambliss R-GE 17 Dodd D-CT 51 Collins R-ME 85 Burns R-MT 18 Stabenow D-MI 52 Specter R-PE 86 Bunning R-KE 19 Mikulski D-MD 53 Mccain R-AZ 87 Crapo R-ID 20 Feingold D-WI 54 Dewine R-OH 88 Mcconnell R-KE 21 Rockefeller D-WE 55 Campbell R-CO 89 Ensign R-NV 22 Hollings D-SC 56 Smith R-OR 90 Cornyn R-TX 23 Kohl D-WI 57 Coleman R-MN 91 Sununu R-NH 24 Inouye D-HA 58 Warner R-VI 92 Santorum R-PE 25 Schumer D-NY 59 Murkowski R-AK 93 Craig R-ID 26 Cantwell D-WA 60 Voinovich R-OH 94 Inhofe R-OK 27 Dayton D-MN 61 Hutchison R-TX 95 Allard R-CO 28 Murray D-WA 62 Lugar R-IN 96 Enzi R-WY 29 Wyden D-OR 63 Miller D-GE 97 Sessions R-AL 30 Daschle D-SD 64 Fitzgerald R-IL 98 Thomas R-WY 31 Biden D-DE 65 Talent R-MO 99 Kyl R-AZ 32 Feinstein D-CA 66 Hagel R-NE 100 Nickles R-OK 33 Jeﬀords I-VE 67 Stevens R-AK 34 Reid D-NV 68 Shelby R-AL

Table 7: Analysis of senator roll call data: Ranking of senators based on ˆθi1.

Unfolding-Model-Based Visualization: Theory, Method and Applications

0.5 0.0 0.5

0.4 0.2 0.0 0.2 0.4

Democrat Republican independent

Figure 8: Analysis of senator roll call data: Visualization of senator points, where senators are classiﬁed by their party membership. Speciﬁcally, The Democrats, Republicans and an independent politician are indicated by blue, red, and green, respectively.

We now investigate the roll calls. The value of ˆaj1, i.e., the roll calls coordinate on the x-axis, seems to represent the roll calls liberalness-conservativeness. The more liberal roll calls lie on the left and the more conservative ones lie on the right. This interpretation is further conﬁrmed by the voting records for the roll calls. In particular, for each roll call, we calculate the proportion of Republicans among the senators who voted Yea . A larger value of this proportion indicates that the roll call is more conservative. As we can see from panel (a) of Figure 9, for roll calls from the left to the right, this proportion increases. In fact, the Kendall s tau rank correlation between ˆaj1s and the proportions of Yea from Republicans is as high as 0.88. We present the content of three roll calls as representative examples. As indicated in panel (a) of Figure 9, these roll calls have substantially diﬀerent coordinates along the x-axis. From left to right, they are (1) To improve the availability of contraceptives for women , (2) Conﬁrmation Thomas J. Ridge, of Pennsylvania, to be Secretary of Homeland Security , and (3) To provide ﬁnancial security to family farm and small business owners by ending the unfair practice of taxing someone at death . Although most of the roll calls lie near the x-axis (i.e., ˆaj2 0), there are still quite a few roll calls which spread out on the y-axis. It seems that the voting on such roll calls is heterogeneous within both parties. Speciﬁcally, we measure heterogeneity of voting within each party by a cross entropy measure, deﬁned as

CE(i) j = p(i) jy log p(i) jy p(i) jn log p(i) jn p(i) jm log p(i) jm,

where i = 1, 2 indicate Democrat and Republican, respectively, and p(i) jy , p(i) jn, and p(i) jm denote the proportions of Yea , Nay , and Not voting within the party for the jth roll call. Cross entropy is a commonly used measure of heterogeneity (Chapter 9, Friedman

Chen, Ying and Zhang

<0.068 0.068 0.52 0.52 0.73 >0.73 examples

0.0 0.2 0.4 0.6 0.8

Figure 9: Analysis of senator roll call data. Panel (a): Visualization of roll call points, where roll calls are classiﬁed by the proportion of Yea from Republicans. Speciﬁcally, roll calls who have the proportions less than 0.068, 0.068-0.52,0.52-0.73 and larger than 0.73 are indicated by black, red, green and blue points, respectively. The yellow solid points are example roll calls to be discussed. Panel (b): Box plots of min{CE(1) j , CE(2) j }, for roll calls lying near the x-axis (|ˆaj2| 0.05) one the left and for those spreading out along the y-axis (|ˆaj2| > 0.05) on the right.

et al., 2001). The larger the cross entropy, the more heterogeneous voting behavior within a party. In panel (b) of Figure 9, we present the box plots of min{CE(1) j , CE(2) j }, for roll calls lying near the x-axis (|ˆaj2| 0.05) and for those spreading out along the yaxis (|ˆaj2| > 0.05). According to panel (b) of Figure 9, the roll calls in the latter group (|ˆaj2| > 0.05) tend to have a larger value of min{CE(1) j , CE(2) j }, implying that the voting tends to be more heterogeneous within both parties for these roll calls. The latter group contains roll calls, such as To provide for the distribution of funds under the infrastructure performance and maintenance program , To enhance the role of Congress in the oversight of the intelligence and intelligence-related activities of the United States Government , and To strike provisions relating to energy tax incentives . Many of such roll calls may be explained by constituency speciﬁc factors.

6. Concluding Remarks

In this paper, we provide a statistical framework for studying unfolding-model-based visualization. An estimator, together with an algorithm for its computation, is proposed, whose performance is examined by simulation studies. Under reasonable conditions, we provide asymptotic results for the recovery of ideal-point conﬁguration. The proposed method is applied to two datasets, one on movie rating and the other on senator voting, for which interpretable results are obtained.

Unfolding-Model-Based Visualization: Theory, Method and Applications

The ideal points obtained from the proposed method can be used in further analysis. For example, one can use the estimated person points as covariates in regression analysis. For another example, one may further conduct cluster analysis on the respondents and items, for example, by applying the K-means algorithm (Mac Queen, 1967). In fact, as discussed in the supplementary material, there is a connection between our unfolding model and the stochastic co-blockmodel (Choi and Wolfe, 2014; Rohe et al., 2016) for bi-cluster analysis. When data follow a stochastic co-blockmodel, then our consistency result for the unfolding model further guarantees the consistency of bi-cluster analysis. The current analysis may be extended along multiple directions. First, the current analysis keeps the latent dimension K ﬁxed. In fact, the theoretical results established in this paper can be generalized to a setting where K also diverges, a more appropriate setting for data of a very large scale. Second, it is possible to make statistical inference about the person and item ideal points, such as testing whether a person point is closer to one item point than another. Making statistical inference under our model is closely related to statistical inference for low-rank matrix completion (see e.g., Chen et al., 2019; Xia and Yuan, 2019), but the non-linear link function in our model brings more challenges and thus methods and theory remain to be developed. Third, although we focus on binary data in this paper, the proposed modeling framework, theory and computational algorithm can be extended to other types of data, such as ratings and rankings. Finally, it may also be of interest to extend the current framework to the modeling and analysis of large-scale preferential choice data with informatively missing data entries.

Acknowledgement

The authors thank the editor and reviewers for their supportive and insightful comments. The authors also thank Prof. Ming Yuan for a fruitful discussion and Prof. Abdo Alfakih for suggesting references on Euclidian distance matrix completion. We would like to acknowledge support for this project from National Academy of Education/Spencer Postdoctoral Fellowship and NSF Grants DMS-2015417, SES-1826540 and IIS-1633360.

Appendix A. Bi-Cluster Analysis

The applications of multidimensional scaling, including multidimensional unfolding as a special case, are often followed by cluster analysis (e.g., Kruskal and Wish, 1978; Borg and Groenen, 2005) for better understanding and interpretation of the data visualization. In our context, it is often of interest to cluster the respondents and the items, respectively. This task is known as bi-clustering or co-clustering (Hartigan, 1972; Dhillon, 2001), which is often studied statistically under the stochastic co-blockmodel (Choi and Wolfe, 2014; Rohe et al., 2016), an extension of the widely used stochastic blockmodel (Holland et al., 1983). Following multidimensional unfolding, it is natural to bi-cluster the respondents and the items based on the estimated ideal points, using the Euclidian distance as a natural measure of dissimilarity. In particular, we use the K-means algorithm (Mac Queen, 1967) to cluster the respondents and the items into k1 and k2 clusters, respectively, for some pre-speciﬁed numbers of clusters k1 and k2. This two-step procedure for bi-cluster analysis is described in Algorithm 2.

Chen, Ying and Zhang

Algorithm 2 (Two-step procedure for bi-cluster analysis)

Step 1: Apply Algorithm 1 and obtain estimates {ˆθ1, ..., ˆθN, ˆa1, ..., ˆa J}.

Step 2: Perform the K-means algorithm to {ˆθ1, ..., ˆθN} and {ˆa1, ..., ˆa J} given k1 and k2 clusters, respectively.

Output: The cluster membership of respondents ˆϑi {1, ..., k1} and cluster membership of items ˆυj {1, ..., k2} (i = 1, ..., N; j = 1, ..., J).

We provide a connection between the multidimensional unfolding model studied in this paper and the stochastic co-blockmodel. Consider a special case under the multidimensional unfolding model, where there are ﬁnite possible locations for the respondent ideal points and also for the item ideal points, independent of N and J. We denote the possible locations for the respondent ideal points as {b 1, ..., b k1} and denote those for the item ideal points as {c 1, ..., c k2}. Under this setting, there exist k1 respondent latent classes and k2 item latent classes, regarding two respondents/items as from the same latent class when they have the same location. We denote ϑ i {1, ..., k1} and υ j {1, ..., k2} the true latent class memberships of respondent i and item j, respectively. In this sense, the model becomes a stochastic co-blockmodel, for which the distribution of Yij is only determined by the latent class memberships of respondent i and item j and Yijs are conditionally independent given all the latent memberships of the respondents and items. In what follows, we show that the proportions of misclassiﬁed respondents and items converge to 0 in probability, when both N and J grow to inﬁnity, if the K-means algorithm in Algorithm 2 has converged to the global optima.

Theorem 23 Suppose A0, A3 and A4 are satisﬁed, and K+ K. Further suppose the multidimensional unfolding model degenerates to a stochastic co-blockmodel, satisfying θ i {b 1, ..., b k1} and a j {c 1, ..., c k2}. If both K-means algorithms in Algorithm 2 converge to the global optima, then the clustering result satisﬁes

PN i=1 1{ˆϑi=ζ(ϑ i )} N , max ζ Bk2

PJ j=1 1{ˆυj=ζ(υ j )} J

goes to 1 in probability as both N and J grow to inﬁnity, where Bk denotes the set of all permutations on {1, ..., k}, for k = k1, k2.

Remark 24 To handle label switching indeterminacy in clustering, in the loss function (12) we ﬁnd permutations that best match the true latent class memberships and their estimates for both the respondents and the items.

Appendix B. Proof of Theoretical Results

B.1 Deﬁnitions and Notations

In this appendix, we use c, C, C1, C2 to represent constants which do not depend on N, J, the values of which may vary according to the context. With a little abuse of notation, we

Unfolding-Model-Based Visualization: Theory, Method and Applications

use AN,J to denote the speciﬁed events, which may diﬀer in diﬀerent proofs. For x RK, we use BK x (C) to denote the closed ball in RK centered at x with radius C. Unless otherwise speciﬁed, all balls in the appendix is assumed to be closed. For a set G RK, let int(G) denote the set of all its interior points. For a positive integer n, we denote [n] := {1, ..., n}. We start with some notions which will be used in the proof of theorems, propositions and lemmas.

Deﬁnition 25 For points xi, x i RK, i = 1, ..., n, we write (x1, ..., xn) (x 1, ..., x n), if there exists an isometry F AK, such that x i = F(xi), i = 1, ..., n.

Remark 26 It is easy to show that is an equivalence relation.

Deﬁnition 27 (Conﬁguration) We deﬁne an n-point conﬁguration as an equivalence class. That is, we deﬁne a conﬁguration

[x1, ..., xn] := {(x 1, ..., x n) : (x 1, ..., x n) (x1, ..., xn)}

as the equivalence class of (x1, ..., xn).

Remark 28 By the property of isometry mapping, it is easy to see that all the elements in the same conﬁguration have the same distance matrix.

We now consider the space of all n-point conﬁgurations in RK, denoted by

Hn,K := [x1, ..., xn] : xi RK, i = 1, ..., n .

For two conﬁgurations τ1 = [x1, ..., xn], τ2 = [y1, ..., yn] Hn,K, we deﬁne

d(τ1, τ2) := inf F AK

1 i n F(xi) yi 2.

First, we note that d( , ) is a well-deﬁned mapping from Hn,K Hn,K to R. That is, for any (x 1, ..., x n) [x1, ..., xn] and (y 1, ..., y n) [y1, ..., yn],

1 i n F(xi) yi 2 = inf F AK

1 i n F(x i) y i 2.

Second, we notice that d( , ) is a metric on Hn,K, as summarized in Lemma 29 below.

Lemma 29 d( , ) is a metric on Hn,K.

Remark 30 For [x1, ..., xn] Hn,K, we have [(x 1 , 0) , ..., (x n , 0) ] Hn,K+1 in which sense we can say Hn,K Hn,K+1. Thus Hn,K1 Hn,K2 if K1 K2. For τ1 = [x1, ..., xn] Hn,K1, τ2 = [y1, ..., yn] Hn,K2, the d(τ1, τ2) is deﬁned in the same way by seeing both τ1 and τ2 as elements in Hn,max{K1,K2}.

Chen, Ying and Zhang

We further denote Pa,b,K as the set of a b partial distance matrices for conﬁgurations in RK : Pa,b,K := ( xi yj 2)a b : [x1, ..., xa, y1, ..., yb] Ha+b,K .

It is easy to check that Pa,b,K Pa,b,K+1. For A1, ..., An RK, denote [A1, ..., An] as a subset of Hn,K :

[A1, ..., An] := {[x1, ..., xn] : xi Ai, i = 1, ..., n}.

For A, B Hn,K, the distance between A and B is deﬁned as

d(A, B) := inf τ1 A,τ2 B d(τ1, τ2). (13)

We further denote

Hn,K,C := {[x1, ..., xn] Hn,K : xi C}

as a compact subset of Hn,K, and

Pa,b,K,C := ( xi yj 2)a b : [x1, ..., xa, y1, ..., yb] Ha+b,K,C (14)

as a compact subset of Pa,b,K. We consider a mapping deﬁned as following:

Φa,b,K : R(a+b) K Pa,b,K,

(x1, ...., xa+b) 7 D,

where D is the a b partial distance matrix of {(x1, ..., xa), (xa+1, ..., xa+b)}. It is not diﬃcult to check that Φa,b,K is invariant with respect to isometry. Then, for τ = [x1, ..., xa+b], we denote Φa,b,K(τ) := Φa,b,K(X),

where X = (x1, ..., xa+b). Having introduced the notions above, we give the following lemma, which is crucial to the proof of Theorem 6. It essentially shows that for any partial distance matrix D Pk1,k2,K+,M that approximates to another partial distance matrix D Pk1,k2,K,M, whose conﬁguration τ contains a collection of anchor points, then any conﬁguration τ of D will also approximate to τ.

Lemma 31 For compact subsets B1, ..., Bk1, C1, ..., Ck2 BK 0 (M), let

B = [B1, ..., Bk1, C1, ..., Ck2].

Suppose that for any (x1, ..., xk1+k2) B1 Bk1 C1 Ck2, {x1, ..., xk1} and {xk1+1, ..., xk1+k2} are a collection of anchor points in RK. Then, for any ϵc > 0, there exists ϵd > 0 such that for any τ Hk1+k2,K+,M and τ B satisfying

Φk1,k2,K+(τ ) Φk1,k2,K+(τ) F < ϵd,

we have d(τ , τ) < ϵc.

Unfolding-Model-Based Visualization: Theory, Method and Applications

We end this section by the following lemma, which will also be used in the proof of Theorem 6.

Lemma 32 Suppose {b 1, ..., b k1}, {c 1, ..., c k2} BK 0 (C) are a collection of anchor points in RK. Then, for any x BK 0 (C), the {x, b 1, ..., b k1}, {c 1, ..., c k2} are also a collection of anchor points in RK.

B.2 Proof of Theorems

Proof [Proof of Theorem 6] We ﬁrst show the proof of (5). For ϵ which is given in condition A3, there exist constant pϵ (0, 1), and balls of radius ϵ in RK, denoted by B1(ϵ), ..., Bk1(ϵ), G1(ϵ), ..., Gk2(ϵ), such that for N, J large enough,

PN l=1 1{θ l Bb i (ϵ), θl Bi(ϵ)}

N > pϵ, i = 1, ..., k1, PJ l=1 1{a l Bc i (ϵ), al Gi(ϵ)}

J > pϵ, i = 1, ..., k2.

This comes straightforwardly from condition A0 and requirement (2) of anchor points in condition A3. Note that the centers of Bk(ϵ) and Gl(ϵ) may vary through N, J. We also use B k(ϵ) and G l (ϵ) to denote Bb k(ϵ) and Bc l (ϵ), respectively. We ﬁrst focus on the set of person points

k=1 {i [N] : θ i B k(ϵ), θi Bk(ϵ)}

and the set of item points

l=1 {j [J] : a j G l (ϵ), aj Gl(ϵ)}.

Let θ+ i = (θ i ) , 0 , a+ j = (a j) , 0 RK+. We will show that there exists an

isometry mapping FN,J AK+, under which FN,J( θi) θ+ i and FN,J( aj) a+ j , for all i I1(ϵ) and j I2(ϵ). This is formalized in the following lemma.

Lemma 33 For N, J large enough, there exists an isometry FN,J AK+, such that

FN,J(x) 4M, for all x BK+ 0 (M),

and for all i I1(ϵ) and for all j I2(ϵ),

FN,J( θi) θ+ i 5ϵ,

and FN,J( aj) a+ j 5ϵ.

Chen, Ying and Zhang

We then show that for most of the person points i / I1(ϵ) and for most of the item points j / I2(ϵ), we still have FN,J( θi) θ+ i and FN,J( aj) a+ j , under the same isometry mapping FN,J as in Lemma 33. This is formalized in Lemma 34 below.

Lemma 34 For N, J large enough, there exists a constant κ > 0, such that for the isometry mapping FN,J deﬁned in Lemma 33, the proportions

PN i=1 1{ FN,J( θi) θ+ i >κϵ} N

PJ j=1 1{ FN,J( aj) a+ j >κϵ} J satisfy λk,N,J 0, (15)

for k = 1, 2, as N, J grow to inﬁnity.

Since by Lemma 33, we have FN,J maps BK+ 0 (M) to BK+ 0 (4M), then for all θi and for all aj, FN,J( θi) θ+ i 5M

and FN,J( aj) a+ j 5M.

Combining this with Lemma 34, we have

(PN i=1 F( θi) θ+ i 2

PJ j=1 F( aj) a+ j 2

PN i=1 FN,J( θi) θ+ i 2

PJ j=1 FN,J( aj) a+ j 2

J 25(M)2λ1,N,J + κ2ϵ2 + 25(M)2λ2,N,J + κ2ϵ2

25(M)2(λ1,N,J + λ2,N,J) + 2κ2ϵ2

By (15), (5) holds. (6) holds if ϵ can be arbitrarily small. We complete the proof.

Proof [Proof of Theorem 12] Combining Theorem 6 and Proposition 11, we have the result.

Proof [Proof of Theorem 14] Theorem 14 is a special case of Proposition 22. See the proof of Proposition 22.

Proof [Proof of Theorem 23] For simplicity of writing, we suppose K+ = K in this proof. We only prove the result for the respondents. The proof for the items is the same. Under

Unfolding-Model-Based Visualization: Theory, Method and Applications

the conditions of Theorem 23, the result of Theorem 12 is satisﬁed and with a slight change in the proof, we can get

PN i=1 ˆθi F(b ϑ i )

Consequently, there exists isometry F N,J, such that PN i=1 ˆθi F N,J(b ϑ i ) 2

N = op(1), (17)

noting that b ϑ i = θ i .

Lemma 35 Under the same conditions as Theorem 23, suppose that PN i=1 ˆθi F N,J(b ϑ i ) 2

Then we have

PN i=1 1{ϑ i =ζ(ˆϑi)} N = op(1).

With Lemma 35, we complete the proof for the respondents.

B.3 Proof of Propositions

Proof [Proof of Proposition 3] It suﬃces to prove in the case when Pn i=1 xi = 0 and Pn i=1 yi = 0. Denote D = (dij)n n, where dij = xi xj 2 = yi yj 2 and let B = (bij)n n = 1

2JDJ, where J = In 1n1 n /n. Then B is inner product matrix of both {x1, ..., xn} and {y1, ..., yn}. That is, bij = x i xj = yiy j , for 1 i, j n. We refer readers to Critchley (1988) for the relation between inner product matrix and distance matrix. So if we denote P1 = (x1, ..., xn) , P2 = (y1, ..., yn) ,

then we have P1P 1 = P2P 2 = B.

Let P 1 = Q1R1, P 2 = Q2R2

be the QR decomposition (see Cheney and Kincaid (2009)) of P1, P2, where Q1, Q2 are k k orthogonal matrix and R1, R2 are k n upper-triangular matrix with non-negative diagonal entries. Since x i xj = y i yj, for 1 i, j n, it is not diﬃcult to check that R1 = R2. If we deﬁne O = Q2Q 1 , then

OP 1 = OQ1R1 = Q2Q 1 Q1R1 = Q2R1 = Q2R2 = P 2 ,

which means Oxi = yi, for 1 i n. We complete the proof.

Proof [Proof of Proposition 8] We ﬁrst introduce a lemma as following.

Chen, Ying and Zhang

Lemma 36 There exists a collection of anchor points {b 1, ..., b k2}, {c 1, ..., c k2} int(G), where G is the ball deﬁned in Proposition 8.

We ﬁx such collection of anchor points. For any ϵ > 0, we denote B k(ϵ), G l (ϵ), for 1 k k1 and 1 l k2, as balls centered at b k and c l , respectively. For suﬃciently small ϵ > 0, it is easy to see that for any

b1 B 1(ϵ), ..., bk1 B k1(ϵ), c1 G 1(ϵ), ..., ck2 G k2(ϵ),

the {b1, ..., bk1}, {c1, ..., ck2} are a collection of anchor points in RK. Therefore, the (1) of A3 holds. We deﬁne βϵ := 1

2 min 1 k k1 1 l k2

{P1B k(ϵ), P2G l (ϵ)}

and use AN,J to denote the following event

i=1 1{θ i B k(ϵ)} P1B k(ϵ)

βϵ, k = 1, ..., k1,

j=1 1{a j G l (ϵ)} P2G l (ϵ)

βϵ, l = 1, ..., k2,

where P1B k(ϵ), P2G l (ϵ) represent the probability measure of B k(ϵ), G l (ϵ) with respect to P1 and P2, respectively. By Hoeﬀding s inequality, we have

Pr ((18) holds ) 1 2k1 exp( 1

2Nβ2 ϵ ) 2k2 exp( 1

2Jβ2 ϵ ). (19)

So we have Pr(AN,J) 1

as N, J grow. On AN,J, we have

i=1 1{θ i B k(ϵ)} βϵ, 1 k k1,

j=1 1{a j G l (ϵ)} βϵ, 1 l k2.

On AN,J, (20) holds. Then, the (2) of A3 holds almost surely.

Proof [Proof of Proposition 11] Proposition 11 is a special case of Proposition 20. See the proof of Proposition 20.

Proof [Proof of Proposition 20] The proof of Proposition 20 is similar to Theorem 1 of Davenport et al. (2014). We only state the main steps.

Unfolding-Model-Based Visualization: Theory, Method and Applications

We denote D as the partial distance matrix of (θ1, ..., θN) and (a1, ..., a J) (to simplify the notation, we ignore the subscripts N and J for D). Since the likelihood function depends on (θ1, ..., θN) and (a1, ..., a J) only through their partial distance matrix, we re-parameterize the likelihood function by D. We denote

lΩ,Y (D) = log LΩ(θ1, ..., θN, a1, ..., a J),

where the subscripts Ω= (ωij)N J and Y = (Yij)N J indicate the random variables in the likelihood function and D contains the parameters. Let lΩ,Y (D) = lΩ,Y (D) lΩ,Y (0), (21)

where 0 represents an N J matrix whose elements are all 0 and let

G = n D RN J : D 4M2p

(K+ + 2)NJ o . (22)

Lemma 37 Under the same conditions as Proposition 20, there exist constant C1 and C2 such that

Pr sup D G | lΩ,Y (D) E lΩ,Y (D)| 4M2C1L4M2 p

n(N + J) + NJ log(NJ)

Let H = {D : dij = θi aj 2, where θi , aj M, i = 1, ..., N, j = 1, ..., J}. It is easy to check that H G. Consequently,

Pr sup D H | lΩ,Y (D) E lΩ,Y (D)| 4C1M2L4M2 p

n(N + J) + NJ log(NJ)

Pr sup D G | lΩ,Y (D) E lΩ,Y (D)| 4C1M2L4M2 p

n(N + J) + NJ log(NJ)

Given the above development, Proposition 20 is implied by the following lemma.

Lemma 38 Under the same conditions as Proposition 20,

1 NJ D N,J ˆDN,J 2 F 16

n β4M2 sup D H | lΩ,Y (D) E lΩ,Y (D)|.

Therefore, with probability at least 1 C2/(N + J),

1 NJ D N,J ˆDN,J 2 F 64C1M2L4M2β4M2 p

1 + NJ log(NJ)

Chen, Ying and Zhang

We complete the proof by absorbing 64 K+ + 2 into C1.

Proof [Proof of Proposition 22] We use AN,J to denote the event that the result in Proposition 20 holds. By Theorem 6 and Proposition 8, on AN,J, we have

PN i=1 θ+ i F(ˆθ Ω i ) 2

PJ j=1 a+ j F(ˆaΩ j ) 2

goes to 0, as N, J grow to inﬁnity. Since Pr(AN,J) 0, we complete the proof.

B.4 Proof of Lemmas

Proof [Proof of Lemma 29] Let τ1 = [x1, ..., xn], τ2 = [y1, ..., yn], τ3 = [z1, ..., zn]. Deﬁne

d(τ1, τ2) := min F AK max i F(yi) xi

and it is easy to check that

d(τ1, τ2) d(τ1, τ2) n d(τ1, τ2).

So we just need to verify that function d( , ) satisﬁes the triangle inequality. Let isometries F21, F31 satisfy

d(τ1, τ2) = max i F21(yi) xi = F21(yl) xl ,

d(τ1, τ3) = max i F31(zi) xi = F31(zm) xm .

d(τ2, τ3) max i { F31(zi) F21(yi) }

max i { F31(zi) xi + F21(yi) xi }

F21(yl) xl + F31(zm) xm

= d(τ1, τ2) + d(τ1, τ3).

We complete the proof.

Proof [Proof of Lemma 31] Otherwise there exist ϵ0 > 0 and sequences {τ (n) 1 } n=1 Hk1+k2,K+,M, and {τ (n) 2 } n=1 B such that Φk1,k2,K+(τ (n) 1 ) Φk1,k2,K+(τ (n) 2 ) F < 1

and d(τ (n) 1 , τ (n) 2 ) > ϵ0.

Unfolding-Model-Based Visualization: Theory, Method and Applications

Since both Hk1+k2,K+,M and B are compact, there exists a subsequence {nk} k=1 N+, such

that limk τ (nk) 1 = τ Hk1+k2,K+,M and limk τ (nk) 2 = τ0 B. The two conﬁgurations τ and τ0 have the same partial distance matrix but d( τ, τ0) > ϵ0. This makes a contradiction because τ0 B is the only conﬁguration of its partial distance matrix, by the requirement of B.

Proof [Proof of Lemma 32] For a collection of points {x, b 1, ..., b k1}, {c 1, ..., c k2}, it is not diﬃcult to verify that condition A2 holds. So we only need to verify A1. To verify A1, it suﬃces to show that if {b1, ..., bk1}, {c1, ..., ck2} is a collection of anchor points, then for any x BK 0 (C), [x, b1, ..., bk1, c1, ..., ck2] is the unique conﬁguration corresponding to its (k1 + 1) k2 partial distance matrix. Suppose that τ = [x, b1, ..., bk1, c1, ...ck2] and τ = [x , b 1, ..., b k1, c 1, ...c k2] satisfy

Φk1+1,k2,K(τ) = Φk1+1,k2,K(τ ).

Then Φk1,k2,K([b1, ..., bk2, c1, ..., ck2]) = Φk1,k2,K([b 1, ..., b k1, c 1, ...c k2]).

Since {b1, ..., bk1}, {c1, ...ck2} are a collection of anchor points, then [b1, ..., bk1, c1, ...ck2] = [b 1, ..., b k1, c 1, ...c k2]. Without loss of generality, we suppose bl = b l and cm = c m. Then, the two conﬁgurations, [x, c1, ..., ck2] and [x, c 1, ..., c k2], have the same complete distance matrix, which further leads that

[x, c1, ..., ck2] = [x , c1, ..., ck2].

Since c1, ..., ck2 can aﬃne span RK, it is not diﬃcult to see that x = x . Then, we get τ = τ , and A1 has been veriﬁed.

Proof [Proof of Lemma 33] We deﬁne

S N,J(ϵ) = B 1(ϵ), ..., B k1(ϵ), G 1(ϵ), ..., G k2(ϵ) Hk1+k2,K,M,

SN,J(ϵ) = h B1(ϵ), ..., Bk1(ϵ), G1(ϵ), ..., Gk2(ϵ) i Hk1+k2,K+,M,

where B k(ϵ), Bk(ϵ), G l (ϵ), Gl(ϵ) are deﬁned in the proof of Theorem 6. Let

σN,J := d( SN,J(ϵ), S N,J(ϵ)) (23)

By (13) and triangle inequality, there exists an iosmetry FN,J AK+, such that for all x k B k(ϵ), y l G l (ϵ), xk Bk(ϵ), yl Gl(ϵ),

FN,J( xk) x+ k 4ϵ + σN,J, 1 k k1,

FN,J( yl) y+ l 4ϵ + σN,J, 1 l k2. (24)

In what follows, we will show that σN,J ϵ for N, J large enough. We ﬁrst deﬁne

γN,J = inf{ Φk1,k2,K+( τ) Φk1,k2,K+(τ ) F : τ SN,J(ϵ), τ S N,J(ϵ)} (25)

Chen, Ying and Zhang

and we have γ2 N,J(pϵN)(pϵJ) DN,J D N,J 2 F = o(NJ),

which leads to γN,J = o(1). (26)

By (23), there exist τ SN,J(ϵ) and τ S N,J(ϵ) such that

Φk1,k2,K+( τ) Φk1,k2,K+(τ ) F 2γN,J.

Then by (23), we have

σN,J = d( SN,J(ϵ), S N,J(ϵ)) d(τ , τ). (27)

As shown in the beginning of proof for Theorem 6 and according to Deﬁnition 4, the τ is the unique conﬁguration corresponding to its k1 k2 partial distance matrix. Since τ SN,J(ϵ) Hk1+k2,K+,M, by Lemma 31, we know d(τ , τ) 0 as N, J grow to inﬁnity, and thus d(τ , τ) < ϵ (28)

for N, J large enough. Finally, since

B k(ϵ), G l (ϵ) BK 0 (M), Bk(ϵ), Gl(ϵ) BK+ 0 (M),

we have, for N, J large enough,

FN,J(x) 4M, for x BK+ 0 (M).

To see this, if there exists x BK+ 0 (M) such that FN,J(x) > 4M, then by simple geometry,

min x B K+ 0 (M) FN,J(x) x > M.

According to (24) and (28), we will get

M < FN,J( xk) x+ k 4ϵ + σN,J 5ϵ,

which contradicts with the fact that ϵ < 1

Proof [Proof of Lemma 34] Let c1, ..., ck2 denote the centers of G1(ϵ), ...., Gk2(ϵ) and c+ l = ( c l , 0 ) RK+. We ﬁrst give the following lemma.

Lemma 39 For any

τ1 = [x, x1, ..., xk2] [BK 0 (M), G 1(ϵ), ..., G k2(ϵ)],

τ2 = [y, y1, ..., yk2] [BK+ 0 (M), B c+ 1 (ϵ), ..., B c+ k2(ϵ)],

l=1 x+ l yl 2

for a constant c, which only depends on the set {c 1, ..., c k2} and M.

Unfolding-Model-Based Visualization: Theory, Method and Applications

H1(ϵ) := {i [N] : FN,J( θi) θ+ i > 5 max(c, 1) p

k1 + k2ϵ} (29)

and H2(ϵ) := {j [J] : FN,J( aj) a+ j > 5 max(c, 1) p

k1 + k2ϵ}, (30)

where c is the constant in Lemma 39. We set the constant κ in Lemma 34 to be 5 max(c, 1) k1 + k2ϵ. and then we have |H1(ϵ)| = Nλ1,N,J, |H2(ϵ)| = Jλ2,N,J. Note that I1(ϵ) H1(ϵ) = , I2(ϵ) H2(ϵ) = for N, J large. We choose i1, ..., ik1 I1(ϵ) and j1, ..., jk2 I2(ϵ) such that

θ ik B k(ϵ), θik Bk(ϵ),

a jl G l (ϵ), ajl Gl(ϵ)

for 1 k k1 and 1 l k2. For any i H1(ϵ), we consider the following conﬁgurations

τ = [θ i , θ i1, ..., θ ik1, a j1, ..., a jk2] Hk1+k2+1,K,M,

τ = [ θi, θi1, ..., θik1, aj1, ..., ajk2] Hk1+k2+1,K+,M

and τ 1 = [θ i , a j1, ..., a jk2] [BK 0 (M), G 1(ϵ), ..., G k2(ϵ)],

τ1 = [ θi, aj1, ..., ajk2] [BK+ 0 (M), G1(ϵ), ..., Gk2(ϵ)].

It is obvious that d( τ, τ ) d ( τ1, τ 1 ) .

By Lemma 33, we have v u u t

l=1 FN,J( ajl) a jl 2 5 p

Combining it with (29) and Lemma (39), we have

d( τ1, τ 1 ) > 5 p

which leads to d( τ, τ ) > 5 p

k1 + k2ϵ. (31)

According to Lemma 32, {θ i , θ i1, ..., θ ik1}, {a j1, ..., a jk2} are a collection of anchor points.

Let D, D Pk1+1,k2,K+,M be the partial distance matrix of τ and τ , respectively. Combining (31) and Lemma 31, there exists a constant δϵ > 0 such that

D D F δϵ. (32)

For each i H1(ϵ), we choose i1, ..., ik1 I1(ϵ) to form a group {i, i1, ..., ik1} [N] such that (θ i , θ i1, ..., θ ik1) BK 0 (M) B 1(ϵ) B k1(ϵ)

Chen, Ying and Zhang

and ( θi, θi1, ..., θik1) BK+ 0 (M) B1(ϵ) Bk1(ϵ).

We could ﬁnd at least min{λ1,N,J, pϵ} N such groups which are mutually exclusive. We could also ﬁnd at least pϵJ mutually exclusive groups of {j1, ..., jk2} [J] such that

(a j1, ..., a jk2) G 1(ϵ) G k2(ϵ)

and ( aj1, ..., ajk2) G1(ϵ) Gk2(ϵ).

By (4) and (32), we have

min{λ1,N,J, pϵ}NpϵJδ2 ϵ o(NJ).

So min{λ1,N,J, pϵ} = o(1),

which means λ1,N,J 0, as N, J grow to inﬁnity. Similar result holds for λ2,N,J and we do not repeat it.

Proof [Proof of Lemma 35] Consider the K-means clustering of the person points in Algorithm 2. We deﬁne a loss function

L(ϑ1, ..., ϑN) = 1

i=1 ˆθi µϑi 2,

as the loss function for K-means clustering, where ϑi {1, ..., k1} represents the cluster membership of person i and

PN i=1 ˆθi1{ϑi=k} PN i=1 1{ϑi=k} denotes the centroid of the kth cluster. Under the conditions of Theorem 23, the K-means clustering converges to the global optima, which implies that

L(ˆϑ1, ..., ˆϑN) = min ϑi {1,...,k1},i=1,...,N L(ϑ1, ..., ϑN). (33)

So for any isometry F AK,

i=1 ˆθi µˆϑi 2

i=1 ˆθi F(b ϑ i ) 2.

By triangle inequality,

i=1 µˆϑi F(b ϑ i ) 2 ! 1

i=1 ( µˆϑi ˆθi 2 ! 1

i=1 ˆθi F(b ϑ i ) 2 ! 1

i=1 ˆθi F(b ϑ i ) 2 ! 1

Unfolding-Model-Based Visualization: Theory, Method and Applications

Deﬁne d = mini =j b i b j and for F AK, deﬁne

AF := {1 i N : µˆϑi F(b ϑ i ) < d

and denote Ac F := {1, ..., N}/AF . Then P i AF N,J 1

P i Ac F N,J 1

P i Ac F N,J µˆϑi F N,J(b ϑ i ) 2

PN i=1 µˆϑi F N,J(b ϑ i ) 2

PN i=1 ˆθi F N,J(b ϑ i ) 2

Lemma 40 Under the same conditions as Lemma 35, if there exists ζ1 Bk1 satisfying

µζ1(l) F N,J(b l ) < d

where µl is the centroid of the lth cluster, F N,J is deﬁned in (17) and d is deﬁned above, then there exists ζ2 Bk1, such that for all i AF N,J, ˆϑi = ζ2(ϑ i ).

Let ΩN,J := {ω : ζ Bk1, s.t. µζ(l)(ω) F N,J(b l ) < d

2, i = 1, ..., k1}. Notice that ΩN,J is a subset of the whole probability space. By Lemma 40, for any ω ΩN,J, there exists ζN,J Bk1, which corresponds to ζ2 in Lemma 40, such that

PN i=1 1{ϑ i =ζ(ˆϑi(ω))} N

PN i=1 1{ˆϑi=ζN,J(ϑ i )} N

P i AF N,J 1

Lemma 41 Under the same conditions as Lemma 35, we have

lim N,J Pr (ΩN,J) = 1,

where ΩN,J is deﬁned above.

By Lemma 41 and (34), we complete the proof.

Proof [Proof of Lemma 36] Without loss of generality, we suppose that the ball G RK

has center at orgin. By Theorem 2.1 of Alfakih (2005), we know there exist k1, k2 K + 1 and two sets of points, {b 1, ..., b k1}, {c 1, ..., c k2} int(G), satisfying condition A2 whose

Chen, Ying and Zhang

partial distance matrix D has unique conﬁguration. Furthermore, points near b i , c j also have this property. Speciﬁcally, there exists ϵ > 0 such that for

bi BK b i (ϵ) G, cj BK c j (ϵ) G,

the {b1, ..., bk1}, {c1, ..., ck2} satisfy condition A2 and their partial distance matrix D has unique conﬁguration. Then, by Lemma 31, condition A1 holds and {b 1, ..., b k1}, {c 1, ..., c k2} are anchor points in RK.

Proof [Proof of Lemma 37] The proof of Lemma 37 is similar to Lemma A.1 of Davenport et al. (2014).

Proof [Proof of Lemma 38] We have

0 lΩ,Y ( ˆDN,J) lΩ,Y (D N,J) = lΩ,Y ( ˆDN,J) E lΩ,Y ( ˆDN,J) + E lΩ,Y ( ˆDN,J) E lΩ,Y (D N,J)

+ E lΩ,Y (D N,J) lΩ,Y (D N,J)

E lΩ,Y ( ˆDN,J) E lΩ,Y (D N,J) + 2 sup D H | lΩ,Y (D) lΩ,Y (D)|.

So E lΩ,Y (D N,J) lΩ,Y ( ˆDN,J) 2 sup D H | lΩ,Y (D) lΩ,Y (D)|.

Notice that

E lΩ,Y (D N,J) lΩ,Y ( ˆDN,J) = E lΩ,Y (D N,J) lΩ,Y ( ˆDN,J)

i,j f(d ij) log( f(d ij)

f( ˆdij) ) + (1 f(d ij)) log( 1 f(d ij)

1 f( ˆdij) )

For two distributions P and Q, let DKL(P Q) denote the Kullback-Leibler divergence

DKL(P Q) := Z p(x) log p(x)

where p(x) and q(x) are the density functions for P and Q, respectively. For 0 < p, q < 1, we use DKL(p q) := p log(p

q ) + (1 p) log(1 p

to denote the Kullback-Leibler divergence between two Bernoulli distributions with parameter p and q, respectively. For P, Q (0, 1)N J, we deﬁne

DKL(P Q) := 1 NJ

i,j DKL(Pij Qij).

For a partial distance matrix DN,J, denote f(DN,J) as the matrix (f(dij))N J. So from above, we know that

n DKL(f(D N,J) f( ˆDN,J)) 2 sup D H | lΩ,Y (D) lΩ,Y (D)|.

Unfolding-Model-Based Visualization: Theory, Method and Applications

Still for 0 < p, q < 1, let

d2 H(p, q) := ( p q)2 + ( p

denote the Hellinger distance between two Bernoulli distributions with parameters p and q, respectively. For P, Q (0, 1)N J, we deﬁne

d2 H(P Q) := 1 NJ

i,j d2 H(Pij, Qij).

It is easy to check that d2 H(p, q) DKL(p q). So

d2 H(f(D N,J), f( ˆDN,J)) 2

n sup D H | lΩ,Y (D) lΩ,Y (D)|

By Lemma A.2 of Davenport et al. (2014), we have

1 NJ ˆDN,J D N,J 2 F 8β4M2d2 H(f(D N,J), f( ˆDN,J))

n β4M2 sup D H | lΩ,Y (D) lΩ,Y (D)|.

Proof [Proof of Lemma 39] Denote

l=1 x+ l yl 2

and then d(τ1, τ2) η and

x+ l yl η, l = 1, ..., k2. (35)

Therefore there exist A OK+ and b RK+ such that

v u u t Ax+ + b y 2 +

l=1 Ax+ l + b yl 2 η,

which leads that Ax+ + b y η (36)

and Ax+ l + b yl η, l = 1, ..., k2. (37)

Combining (35) and (37), we get

Ax+ l + b x+ l 2η, l = 1, ..., k2.

Chen, Ying and Zhang

According to condition A2, x1, ..., xk2 can aﬃne span RK. Then there exists α1, ..., αk1 satisfying Pk2 l=1 αl = 1, such that x = Pk2 l=1 αlxl. So we have

Ax+ + b x+ =

l=1 αl(Ax+ l + b x+ l )

j=1 |αj|)η.

Combining it with (36), we have x+ y (2 k1 P

j=1 |αj| + 1)η. We complete the proof by

setting the constant c in Lemma 39 to be

max x BK 0 (M) xl G l (ϵ)

inf P l αl=1 x=P l αlxl

Proof [Proof of Lemma 40] For i, j AF N,J, if ϑ i = ϑ j and suppose they are both equal to k, then µˆϑi F N,J(b k) < d/2 and µˆϑj F N,J(b k) < d/2. Given the condition in Lemma 40, it is easy to check that there is only one µl among {µ1, ..., µk1} satisfying

µl F N,J(b k) < d/2,

then ˆϑi = ˆϑj. If ϑ i = ϑ j, then µˆϑi F N,J(b ϑ i ) < d/2 and µˆϑj F N,J(b ˆϑj) < d/2. So

µˆϑi µˆϑj F N,J(b ϑ i ) F N,J(b ϑ j ) µˆϑi F N,J(b ϑ i ) µˆϑj F N,J(b ˆϑj)

which means ˆϑi = ˆϑj. So there exists ζ2 such that for i AF N,J, ˆϑi = ζ2(ϑ i ).

Proof [Proof of Lemma 41] Let

Γ(ϵ ) N,J := {ω :

P i AF N,J 1

which is a subset of the whole probability space. By (34), for any ϵ > 0, we have

lim N,J Pr(Γ(ϵ ) N,J) = 1.

For any ω / ΩN,J, there exists l such that µm(ω) F N,J(b l ) d/2, for m = 1, ..., k1. So for i satisfying ϑ i = l, we have µˆϑi(ω) F N,J(b ϑ i ) d/2. According to (2) of condition

A3, for suﬃciently small ϵ , if N, J are suﬃciently large, then ω / Γ(ϵ ) N,J, which means

Γ(ϵ ) N,J ΩN,J. By (34), for suﬃciently small ϵ ,

lim N,J Pr(ΩN,J) lim N,J Pr(Γ(ϵ ) N,J) = 1.

We complete the proof.

Unfolding-Model-Based Visualization: Theory, Method and Applications

1.0 0.5 0.0 0.5 1.0 1.5

Figure 10: Analysis of movie rating data: Simultaneous visualization of the estimated movie and user points.

Appendix C. Algorithm-based MDU: Real Data Examples

To compare the proposed method with classical algorithm-based MDU methods, we apply ordinal MDU (Busing et al., 2005) to both real datasets analyzed in the paper. The application is based on the implementation in R package smacof (de Leeuw and Mair, 2009). For both examples, the latent dimension is set to two, and all the tuning parameters are set to be the default ones. The results below show that the ordinal MDU approach provides similar visualization results as the proposed one, especially for the roll call voting data due to its unidimensional nature. The results for the movie rating dataset are also similar for the two methods, but the interpretable patterns from the ordinal MDU approach is not as clear as the proposed one. Figures 10 through 12 show the same plots as in Figures 4 through 6 in Section 5.1, respectively, for the movie rating dataset. Figure 10 provides the simultaneous visualization of the movie and user points. Similar to the plot in Figure 4 given by our method, the movies and the users tend to form two giant clusters that only slightly overlap. Figure 11 is similar to Figure 5, where the two panels show the same scatter plot for the movie points. In the left panel, the movies are stratiﬁed by the the numbers of ratings that they received, where diﬀerent stratums are marked by diﬀerent colors. In the right panel, the movies are stratiﬁed by their release time. Recall that the patterns of popularity and release time are captured by the proposed method as shown in Figure 5. Figure 11 seems also to capture these patterns, but not as clear as those in Figure 5. According to panel (a) of Figure 11, the more popular movies tend to be located near the origin, while the less popular movies tend to be located away from the origin. According to panel (b) of Figure 11, the clustering patten of the movies can be largely explained by the three

Chen, Ying and Zhang

1.5 1.0 0.5 0.0 0.5 1.0 1.5

<=127 128 169 170 229 >=230

1.5 1.0 0.5 0.0 0.5 1.0 1.5

1997 1998 1995 1996 before 1995

Figure 11: Analysis of movie rating data. Panel (a): Visualization of movie points, with movies stratiﬁed into four equal-size categories based on the numbers of rating. Movies with numbers of rating less than 127, 128-169, 170-229 and more than 230 are indicated by black, red, green and blue points, respectively. Panel (b): Visualization of movie points, with movies stratiﬁed into three categories based on their release time. Movies released in 1997-1998, 1995-1996, and before 1995 are indicated by green, red and black points, respectively.

categories of release dates. From the left to the right of the space, the points correspond to movies from the relatively older ones to the relatively more recent ones. Figure 12 shows the same plots as in Figure 6. Similar pattern is shown that the shorter the average distance from a user point to the movies points, the more active the user is. In Figure 12, users are classiﬁed into four equal-size groups depending on the numbers of movies they rated. These groups of users, from the most active one to the least active one, lie from the top left to the bottom right. Figures 13 through 15 show the same plots as in Figures 7 through 9 in Section 5.2, respectively, for the roll call voting dataset. Figure 13 provides the simultaneous visualization of senators and roll calls. Similar to the plot in Figure 7, most of the points tend to lie on a straight line. Figure 14 provides a scatter plot of the senator points. Similar to Figure 8, most of the senator points tend to locate around a straight line, with the Democrats on one side and the Republicans on the other side. Also similar to Figure 8, the independent senator, Jim Jeﬀords from the state of Vermont, is mixed together with the Democrats, while the Democrat senator, Zell Miller from the state of Georgia, is mixed together with the Republicans. Finally, Figure 15 shows the unfolding results for the roll calls. The pattern in panel (a) of Figure 15 is similar to that of Figure 9, where from the right to the left, the proportion of Yeas from the Republicans increases. Also similar to Figure 9, although most of the roll calls lie near the x-axis, there are still quite a few of them spreading out along the

Unfolding-Model-Based Visualization: Theory, Method and Applications

1.0 0.5 0.0 0.5 1.0 1.5

1.5 1.0 0.5 0.0 0.5 1.0 1.5

<=24 25 47 48 103 >=104

Figure 12: Analysis of movie rating data: Visualization of user points, with users classiﬁed into four equal-size categories based on the numbers of rating. Users who rated less than 24, , 25-47, 48-103 and more than 104 movies are indicated by black, red, green and blue points, respectively.

2 1 0 1 2 3

senator roll call

Figure 13: Analysis of senator roll call data: Simultaneous visualization of the estimated senator and roll call ideal points.

Chen, Ying and Zhang

1.0 0.5 0.0 0.5 1.0

Democrats Republican Independent

Figure 14: Analysis of senator roll call data: Visualization of senator points, where senators are classiﬁed by their party membership. Speciﬁcally, The Democrats, Republicans and an independent politician are indicated by blue, red, and green, respectively.

y-axis. According to panel (b) of Figure 15 based on the cross entropy measure, the voting behavior on these roll calls tends to be heterogeneous within both parties. This result is similar to that given in panel (b) of Figure 9.

Unfolding-Model-Based Visualization: Theory, Method and Applications

1.5 1.0 0.5 0.0 0.5 1.0 1.5

2 1 0 1 2 3

<0.068 0.068 0.52 0.52 0.73 >0.73

0.0 0.2 0.4 0.6 0.8

Figure 15: Analysis of senator roll call data. Panel (a): Visualization of roll call points, where roll calls are classiﬁed by the proportion of Yeas from Republicans. Speciﬁcally, roll calls who have the proportions less than 0.068, 0.068-0.52,0.52-0.73 and larger than 0.73 are indicated by black, red, green and blue points, respectively. Panel (b): Box plots of min{CE(1) j , CE(2) j }, for roll calls lying near the x-axis (|ˆaj2| 0.05) one the left and for those spreading out along the y-axis (|ˆaj2| > 0.05) on the right.

Chen, Ying and Zhang

Abdo Y. Alfakih. On the uniqueness of euclidean distance matrix completions: the case of points in general position. Linear Algebra and its Applications, 397:265 277, 2005.

Francis R. Bach. Consistency of trace norm minimization. Journal of Machine Learning Research, 9:1019 1048, 2008.

Jushan Bai and Serena Ng. Determining the number of factors in approximate factor models. Econometrica, 70:191 221, 2002.

Ryan Bakker and Keith T. Poole. Bayesian metric multidimensional scaling. Political Analysis, 21:125 140, 2013.

David J. Bartholomew, Martin Knott, and Irini Moustaki. Latent variable models and factor analysis: A uniﬁed approach. John Wiley & Sons, West Sussex, UK, 2011.

Joseph F. Bennett. Determination of the number of independent parameters of a score matrix from the examination of rank orders. Psychometrika, 21:383 393, 1956.

Joseph F. Bennett and William L. Hays. Multidimensional unfolding: Determining the dimensionality of ranked preference data. Psychometrika, 25:27 43, 1960.

Ingwer Borg and Patrick J. F. Groenen. Modern multidimensional scaling: Theory and applications. Springer, New York, NY, 2005.

Frank M. T. A. Busing, Patrick J. K. Groenen, and Willem J. Heiser. Avoiding degeneracy in multidimensional unfolding by penalizing on the coeﬃcient of variation. Psychometrika, 70:71 98, 2005.

Tony Cai and Wen-Xin Zhou. A max-norm constrained minimization approach to 1-bit matrix completion. Journal of Machine Learning Research, 14:3619 3647, 2013.

Kehui Chen and Jing Lei. Network cross-validation for determining the number of communities in network data. Journal of the American Statistical Association, 113:241 251, 2018.

Lisha Chen and Andreas Buja. Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis. Journal of the American Statistical Association, 104:209 219, 2009.

Yunxiao Chen, Xiaoou Li, and Siliang Zhang. Structured latent factor analysis for largescale data: Identiﬁability, estimability, and their implications. Journal of the American Statistical Association, 115:1756 1770, 2020.

Yuxin Chen, Jianqing Fan, Cong Ma, and Yuling Yan. Inference and uncertainty quantiﬁcation for noisy matrix completion. Proceedings of the National Academy of Sciences, 116:22931 22937, 2019.

Ward Cheney and David Kincaid. Linear algebra: Theory and applications. The Australian Mathematical Society, 110:544 550, 2009.

Unfolding-Model-Based Visualization: Theory, Method and Applications

David Choi and Patrick J. Wolfe. Co-clustering separately exchangeable network data. The Annals of Statistics, 42:29 63, 2014.

Joshua Clinton, Simon Jackman, and Douglas Rivers. The statistical analysis of roll call data. American Political Science Review, 98:355 370, 2004a.

Joshua D. Clinton, Simon Jackman, and Doug Rivers. The most liberal senator ? Analyzing and interpreting congressional roll calls. PS: Political Science & Politics, 37:805 811, 2004b.

Clyde H. Coombs. A theory of data. Wiley, New York, NY, 1964.

Frank Critchley. On certain linear mappings between inner-product and squared-distance matrices. Linear Algebra and its Applications, 105:91 107, 1988.

Mark A. Davenport, Yaniv Plan, Ewout van den Berg, and Mary Wootters. 1-bit matrix completion. Information and Inference: A Journal of the IMA, 3:189 223, 2014.

Jan de Leeuw and Patrick Mair. Multidimensional scaling using majorization: SMACOF in R. Journal of Statistical Software, 31:1 30, 2009. URL http://www.jstatsoft.org/ v31/i03/.

Wayne S. De Sarbo and Donna L. Hoﬀman. Constructing MDS joint spaces from binary choice data: A multidimensional unfolding threshold model for marketing research. Journal of Marketing Research, 24:40 54, 1987.

Wayne S. De Sarbo, Martin R. Young, and Arvind Rangaswamy. A parametric multidimensional unfolding procedure for incomplete nonmetric preference/choice set data in marketing research. Journal of Marketing Research, 34:499 516, 1997.

Inderjit S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 269 274, 2001.

David Donoho and Victoria Stodden. When does non-negative matrix factorization give a correct decomposition into parts? In S. Thrun, LK. Saul, and B. Sch olkopf, editors, Advances in neural information processing systems, pages 1141 1148. MIT Press, Cambridge, MA, 2004.

Susan E. Embretson and Steven P. Reise. Item response theory. Psychology Press, Hove, UK, 2000.

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Springer, New York, NY, 2001.

Albert Giﬁ. Nonlinear multivariate analysis. Wiley, New York, NY, 1990.

Michael J. Greenacre and Michael W. Browne. An eﬃcient alternating least-squares algorithm to perform multidimensional unfolding. Psychometrika, 51:241 250, 1986.

Chen, Ying and Zhang

F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (Tii S), 5:1 19, 2016.

John A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67:123 129, 1972.

William L. Hays and Joseph F. Bennett. Multidimensional unfolding: Determining conﬁguration from complete rank order preference data. Psychometrika, 26:221 238, 1961.

Melvin J. Hinich. A new method for statistical multidimensional unfolding. Communications in Statistics Theory and Methods, 34:2299 2310, 2005.

Ying Ho, Yuho Chung, and Kin-nam Lau. Unfolding large-scale marketing data. International Journal of Research in Marketing, 27:119 132, 2010.

Paul W. Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social Networks, 5:109 137, 1983.

Patrik O. Hoyer. Non-negative matrix factorization with sparseness constraints. Journal of machine learning research, 5:1457 1469, 2004.

Joseph B. Kruskal. Multidimensional scaling by optimizing goodness of ﬁt to a nonmetric hypothesis. Psychometrika, 29:1 27, 1964.

Joseph B. Kruskal and Myron Wish. Multidimensional scaling. Sage, Beverly Hills, CA, 1978.

Brigitte Le Roux and Henry Rouanet. Multiple correspondence analysis. Sage, Newbury Park, CA, 2010.

Tianxi Li, Elizaveta Levina, and Ji Zhu. Network cross-validation by edge sampling. Biometrika, 107:257 276, 2020.

Fan Lu, S und uz Kele s, Stephen J. Wright, and Grace Wahba. Framework for kernel regularization with application to protein clustering. Proceedings of the National Academy of Sciences, 102:12332 12337, 2005.

James Mac Queen. Some methods for classiﬁcation and analysis of multivariate observations. In Lucien M. Le Cam and Jerzy Neyman, editors, Proceedings of the ﬁfth Berkeley symposium on mathematical statistics and probability, pages 281 297. University of California Press, Berkeley, CA, 1967.

Patrick Mair, Jan de Leeuw, and Marcus Wurzer. Multidimensional unfolding. In Wiley Stats Ref: Statistics Reference Online. New York: Wiley, 2015.

Patrick Mair, Ingwer Borg, and Thomas Rusch. Goodness-of-ﬁt assessment in multidimensional scaling and unfolding. Multivariate Behavioral Research, 51:772 789, 2016.

Peter J. Olver. Classical invariant theory. Cambridge University Press, Cambridge, UK, 1999.

Unfolding-Model-Based Visualization: Theory, Method and Applications

Megan H. Papesh and Stephen D. Goldinger. A multidimensional scaling analysis of ownand cross-race face spaces. Cognition, 116:283 288, 2010.

Keith T. Poole. Nonparametric unfolding of binary choice data. Political Analysis, 8: 211 237, 2000.

Keith T. Poole. Spatial models of parliamentary voting. Cambridge University Press, Cambridge, UK, 2005.

Keith T. Poole and Howard Rosenthal. Patterns of congressional voting. American Journal of Political Science, 35:228 278, 1991.

Keith T. Poole, Howard Rosenthal, and Kenneth Koford. On dimensionalizing roll call votes in the US congress. American Political Science Review, 85:955 976, 1991.

Sophia Rabe-Hesketh and Anders Skrondal. Generalized latent variable modeling: Multilevel, longitudinal, and structural equation models. Chapman and Hall/CRC, New York, NY, 2004.

Mark D. Reckase. Multidimensional Item Response Theory. Springer, New York, NY, 2009.

Karl Rohe, Tai Qin, and Bin Yu. Co-clustering directed graphs to discover asymmetries and directional communities. Proceedings of the National Academy of Sciences, 113: 12679 12684, 2016.

Fumiko Samejima. Graded response model. In Wim J van der Linden and Ronald K Hambleton, editors, Handbook of modern item response theory, pages 85 100. Springer, New York, NY, 1997.

Yoshio Takane, Forrest W. Young, and Jan de Leeuw. Nonmetric individual diﬀerences multidimensional scaling: An alternating least squares method with optimal scaling features. Psychometrika, 42:7 67, 1977.

Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319 2323, 2000.

Katrijn Van Deun, Willem J. Heiser, and Luc Delbeke. Multidimensional unfolding by nonmetric multidimensional scaling of Spearman distances in the extended permutation polytope. Multivariate Behavioral Research, 42:103 132, 2007.

Dong Xia and Ming Yuan. Statistical inferences of linear forms for noisy matrix completion. ar Xiv preprint ar Xiv:1909.00116, 2019.

Luwan Zhang, Grace Wahba, and Ming Yuan. Distance shrinkage and Euclidean embedding via regularized kernel estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78:849 867, 2016.