# latent_smooth_skeleton_embedding__78765f9d.pdf

Latent Smooth Skeleton Embedding

Li Wang Department of Mathematics, Statistics and Computer Science University of Illinois at Chicago liwang8@uic.edu

Qi Mao HERE Company qimao.here@gmail.com

Ivor W. Tsang Centre for Artiﬁcial Intelligence University of Technology Sydney Ivor.Tsang@uts.edu.au

Learning a smooth skeleton in a low-dimensional space from noisy data becomes important in computer vision and computational biology. Existing methods assume that the manifold constructed from the data is smooth, but they lack the ability to model skeleton structures from noisy data. To overcome this issue, we propose a novel probabilistic structured learning model to learn the density of latent embedding given high-dimensional data and its neighborhood graph. The embedded points that form a smooth skeleton structure are obtained by maximum a posteriori (MAP) estimation. Our analysis shows that the resulting similarity matrix is sparse and unique, and its associated kernel has eigenvalues that follow a power law distribution, which leads to the embeddings of a smooth skeleton. The model is extended to learn a sparse similarity matrix when the graph structure is unknown. Extensive experiments demonstrate the effectiveness of the proposed methods on various datasets by comparing them with existing methods.

In many ﬁelds of science and engineering, one is often confronted with the problem of dimensionality reduction (Burges 2009; Van der Maaten, Postma, and van den Herik 2009). The problem aims to extract low-dimensional structures from high-dimensional datasets, which are generally characterized by much fewer degrees of freedom than actual number of features. In this paper, we are particularly interested in unveiling a smooth skeleton structure in a latent space from data with noise. Figure 1 illustrates an intuitive example in which synthetic data points are drawn from a smooth circle with noises in two-dimensional space. It is challenging to recover the circle (Figures 1(c) and 1(d)) from the noisy data without any prior knowledge of the structure. Datasets with a smooth skeleton structure have become widely accessible in computer vision (Weinberger and Saul 2006) and computational biology (Curtis et al. 2012). In the study of human cancer, a widely accepted hypothesis is that human cancer is a dynamic disease developed over an extended period with the accumulation of genetic alterations (Greaves and Maley 2012). The evolution trajectories of tumor persistence, growth, and ultimately metastasis, are complex and branching (Greaves and Maley 2012). Massive molecular proﬁle

Copyright c 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

1.5 1 0.5 0 0.5 1 1.5

1 0.5 0 0.5 1 1

0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4

0.4 0.2 0 0.2 0.4

(a) Original data (b) MVU (c) PSL-ℓ2 (d) PSL-ℓ1

Figure 1: A synthetic example illustrating the motivation for unveiling the smooth skeleton structure from noisy data. The data is drawn from a circle with added noise. Each point is colored for the purpose of illustration. Our two proposed methods PSL-ℓ2 and PSL-ℓ1 are compared with MVU.

data from excised tumor tissue makes it feasible to uncover the branching architecture of cancer evolution (Mao et al. 2015). However, learning a smooth branching structure embedded in a low-dimensional space from high-dimensional noisy datasets poses a great challenge. Existing methods mostly rely on distances (or similarities) to model the intrinsic structure of data. They either provide a similarity matrix as a prior (Belkin and Niyogi 2001; Sch olkopf, Smola, and Muller 1999), or learn a similarity measurement based on a subset of distances in a local region (Elhamifar and Vidal 2011; Saul and Roweis 2003), or directly learn a kernel matrix from data (Weinberger, Packer, and Saul 2005; Xiao, Sun, and Boyd 2006; Mao and Tsang 2010). These distances become unreliable if the data is noisy. Moreover, they lack the ability to model a smooth skeleton from noisy data. As shown in Figure 1, the strict distance preservation in maximum variance unfolding (MVU) (Weinberger, Sha, and Saul 2004) fails to capture the smooth circle from the data (see Figure 1 (b)). We aim to learn a smooth skeleton from noisy data. To achieve this goal, we ﬁrst present expected distances of embedded data points following an unknown density. We then propose a novel probabilistic structured learning model to learn the density of latent embedding variables given highdimensional data. The main contributions of this paper are: 1) By directly modeling the unknown density of embedding latent variables, the proposed model can be considered as a probabilistic version of MVU. The embedding data is obtained by MAP estimation. 2) Our duality analysis shows that the unknown density has an analytic solution in the form of a sparse similarity

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

matrix or a regularized Laplacian kernel (Smola and Kondor 2003), and the eigenvalues of the kernel learned by the proposed model follows a power law distribution, which leads to a smooth skeleton of embedded points, while the duality view of MVU cannot achieve this. 3) The proposed model possesses a variety of advantageous properties from probabilistic and discriminative viewpoints, including the robust representation of expected distances, easy extension for error tolerance, model selection of neighborhood structures, and global optimum of the resulting convex optimization problems. 4) We further extend the proposed model for two settings: a neighborhood graph is given as a priori but distances are noisy, and the graph is unknown. An efﬁcient alternating direction of multiplier method (ADMM) is proposed to handle an optimization problem that generalizes both cases.

Related Work Let Y = {yi}N i=1 be a set of N data points where yi RD. MVU aims to ﬁnd a set of embedded data points X = {xi}N i=1 where xi Rd and d < D such that the variance of the embedded points is maximized subject to constraints such that distances between nearby data points are preserved (Weinberger, Sha, and Saul 2004). MVU consists of three steps. The ﬁrst step is to compute the k-nearest neighbors Ni of data point yi, i. Let φi,j = ||yi yj||2 and Di,j = ||xi xj||2. The second step is to solve the following optimization problem given by

i=1 ||xi||2 :

i=1 xi = 0, Di,j = φi,j, i, j Ni, (1)

where the ﬁrst constraint eliminates the translational degree of freedom on the embedded data points by constraining them to be centered at the origin; the remaining constraints preserve distances between k-nearest neighbors. Instead of optimizing X, MVU reformulates (1) as a semideﬁnite programming by learning a kernel matrix K with the (i, j)th element denoted by Ki,j = xi, xj with a semideﬁnite constraint K 0 for a valid kernel (Scholkopf and Smola 2001). We have Di,j = ||xi xj||2 = Ki,i + Kj,j 2Ki,j. The resulting problem is given by

max K Tr(K) :

i,j Ki,j = 0, K 0, Di,j = φi,j, i, j Ni,

i,j Ki,j = N i=1 xi, N j=1 xj = 0 is a relaxed constraint for ease of kernelization. The ﬁnal step is to obtain the embedding X by applying KPCA (Sch olkopf, Smola, and Muller 1999) on the optimal K. A duality view of the MVU problem has been studied in (Xiao, Sun, and Boyd 2006). Deﬁne Ei,j as an N N matrix consisting of only four nonzero elements: Ei,j[i, i] = Ei,j[j, j] = 1, Ei,j[i, j] = Ei,j[j, i] = 1. The preserving constraints can be rewritten as Tr(KEi,j) = φi,j, i, j Ni. Thus, the dual MVU problem is

i,j Ni wi,jφi,j : λN 1(L) 1, L =

i,j Ni wi,j Ei,j, (2)

where wi,j is the dual variable subject to the preserving constraint associated with edge (i, j), and λN 1 denotes the second smallest eigenvalue of a symmetric matrix.

Expected Distance Preserving Even though MVU has been successfully applied to a number of datasets, two challenging problems exist. 1) The distances over noisy data lack the reliability for modeling embedded points, and the manifold constructed over these distances may not directly reﬂect a smooth skeleton structure. Thus, it is challenging to learn a smooth skeleton structure from noisy data in high dimension. 2) The k-nearest neighbor graph might be not adequate for modeling points with large variances in different regions (Elhamifar and Vidal 2011). It is important to automatically learn a neighborhood graph so as to better approximate the true structure. We resolve these issues by proposing a novel probabilistic model, which can be viewed as a probabilistic version of MVU.

Proposed Probabilistic Model As the embedding X is the variable of interest, we treat it as a random variable. Let X = [x1, . . . , x N] RN d be a matrix of embedded points. Deﬁne p(X) as the density of embedded points, and the bases of the embedding space are assumed to be independent so that p(X) = d k=1 p(fk) where [f1, . . . , fd] = XT . This assumption is commonly used in spectral methods. We can now reformulate the distance of embedded points xi and xj as the expected distance with respect to density p(X) given by E[||xi xj||2] = d k=1 (fi,k fj,k)2p(fk)dfk, where the equality holds due to ||xi xj||2 = d k=1(fi,k fj,k)2 and the independence over features. The centralized constraint can be reformulated for each reduced dimension as E[ N i=1 fi,k] = 1T fkp(fk)dfk = 0, k. In addition, we assume that the prior distribution p0(X) = d k=1 p0(fk) where p0(fk) is a multivariate normal distribution with zero mean and covariance matrix γ 1I, i.e. fk N(0, γ 1I). In order to learn p(X) from data Y = [y1, . . . , y N] RD N given a set of expected constraints, the principle of maximum entropy is used. Given a prior distribution and a neighborhood graph, we minimize the following optimization problem

min {p(fk) Pk}d k=1

p(fk) log p(fk)

p0(fk)dfk (3)

s.t. E[||xi xj||2] = φi,j, i, j Ni,

where P = d k=1Pk is a Cartesian product of d i.i.d. probability spaces and Pk = { p(fk)dfk = 1, p(fk) 0} is a feasible set of density functions. Proposition 1. Problem (3) has an analytic solution

p(fk) = N 0, (L + γI) 1 , k, (4)

where Laplacian matrix L = diag(W1) W, the (i, j)th element wi,j of W is the dual variable of its corresponding distance preserving constraint, and the dual optimization problem is convex given by

max W d 2 log det(L + γI) 1

s.t.L + γI 0, wi,j = 0, i, j Ni,

where Φ is a distance matrix with the (i, j)th element as φi,j if for any i, j Ni and otherwise, and , is the standard trace inner product. According to Proposition 1, there are several interesting properties. First, the zero mean constraint holds automatically in (3). This can be veriﬁed by the posterior distribution with zero mean. Second, our model can obtain smooth skeleton structure of embedding, which will be discussed in details below.

Analyses from Spectrum and Optimization The objective function contains log-determinant of L + γI, which can be equivalently formulated as log det(L + γI) = N i=1 log(λi(L)+γ), where λi denotes the ith largest eigenvalue of a symmetric matrix. Thus, the log determinant can be related to the negative log-likelihood of a power law distribution of λi as p(λi) λ θ i where θ is called the power law exponent, and γ is a positive term used to make λi +γ > 0. The power law distribution imposes large values on a small set of eigenvalues, while the remaining eigenvalues have small values. In the proposed model, θ = d

2. This is critically different from the dual MVU problem where the second smallest eigenvalue is maximized (Xiao, Sun, and Boyd 2006). The positive term γ represents the sensitivity threshold for the parameters and is used to smooth out the scale-free property (Liu and Ihler 2011). If γ λi, we have log(λi + γ) 1 γ λi + log(γ). This generalizes the ℓ1 regularizer over the eigenvalues. According to the above spectral analysis, the proposed model puts more emphasis on the whole spectrum of the Laplacian matrix following a power law distribution. The difference between MVU and our model will be illustrated in the experiments. From an optimization perspective, the proposed unfolding framework provides a novel approach to learn a sparse similarity matrix W automatically from a set of pairwise distances, and the similarity matrix is intentionally designed for learning the embedded points that achieve a smooth skeleton structure because: (i) The proposed formulation (3) learns a posterior distribution of embedded points since the expected distance of these points with respect to the posterior distribution has much more ﬂexibility than the deterministic distances used in MVU. (ii) The smoothness of the manifold structure is achieved by the expected constraints over an inﬁnite number of possible candidates of embedded points, where distances can be varied ﬂexibly so that these distances may not be strictly preserved. In contrast, deterministic constraints used in MVU are strictly preserved. (iii) The not-restrictively-preserved constraints allow the embedded points to move ﬂexibly to form a smooth skeleton in terms of W, Φ in (5). In other words, by minimizing W, Φ , if two points are close on the given graph, their corresponding embedded points are also close.

Embedding via MAP Estimation Once W has been obtained, the posterior distribution of embedding is explicitly represented as p(X) = d k=1 p(fk), which is the same as the matrix normal distribution (Gupta and Nagar 1999) given by p(F|Y) MN N,d(0, U, I),

where U = (L + γI) 1 is the sample-based covariance matrix and can also be interpreted as a regularized Laplacian kernel with regularization parameter γ (Smola and Kondor 2003). Given the posterior distribution, we can obtain the point estimate of X by using MAP estimation. Thus, from probabilistic point of view, we can apply KPCA on U to achieve the embedded data points similar to MVU for embedding. For reference, we name the proposed model as Probabilistic Structured Learning (PSL).

Latent Smooth Skeleton Embedding In addition to imposing constraints for strictly preserving distances, we also consider variants of the difference between two pairwise distances. One is to tolerate the errors of distances on the edges of a given neighborhood graph, the other is to learn the neighborhood graph from data. Both relaxations can be formulated according to the maximum entropy density estimation with generalized regularization (Dud ık, Phillips, and Schapire 2007). Next, we propose the two variants of PSL to learn smooth skeleton embedding on noisy high-dimensional data.

Known Neighborhood Structure Suppose that a neighborhood graph G is known in advance and can reliably capture the underlying structure of the data. However, the distance φi,j is not reliable due to noisy samples or features. In this case, we aim to obtain a set of embedded points that can preserve the distances with a penalty on the violated pair of points corresponding to an edge in {(i, j) : i, j Ni}. By introducing variable ξi,j for the distance violation on edge (i, j) and parameter C > 0, PSL with ℓ2 regularization (PSL-ℓ2) can be formulated as

min {p(fk) Pk}d k=1

p(fk) log p(fk)

p0(fk)dfk + C

i,j Ni ξ2 i,j (6)

s.t. E[||xi xj||2] φi,j = ξi,j, i, j Ni.

Proposition 2. Problem (6) has an analytic solution (4) where W can be obtained by solving

max W d 2 log det(L + γI) 1

2C ||W||2 F (7)

s.t. L + γI 0, wi,j = 0, i, j Ni,

where ||W||F is the Frobenius norm of W, Φ is a matrix with the (i, j)th element as φi,j if for any i, j Ni and otherwise. The violation can be computed as ξi,j = wi,j

C . According to Proposition 2, a large C imposes a small violation of pairwise distances. As C , problem (6) is equivalent to (3). Note that MVU was extended for data without noise by introducing a tolerance term similar to (6) (Weinberger and Saul 2006), whose motivation was only to overcome the lock up effect of embedding data if strictly preserving distances is impossible for data without noise.

Unknown Neighborhood Structure If the data does not provide a reliable neighborhood structure as a prior, we propose to automatically learn a sparse graph by imposing sparsity on W. To achieve this goal, we modify the constraints in (6) such that the absolute difference

of distances between E[||xi xj||2] and φi,j is restricted in a range between βφi,j and βφi,j, which is the scaled distance by β. In other words, we can interpret the parameter β as a tolerance for the deviation of an embedding distance from its associated observed distance, which we want to preserve. This variant of the MAP unfolding with the deﬁned constraints called PSL with ℓ1 regularization (PSL-ℓ1) is formulated as

min {p(fk) Pk}d k=1

p(fk) log p(fk)

p0(fk)dfk (8)

s.t. E[||xi xj||2] φi,j βφi,j, i, j. Proposition 3. Problem (8) has an analytic solution (4) where W can be obtained by solving

max W d 2 log det(L + γI) 1

4 W, Φ β||Φ W||1 (9)

s.t. L + γI 0

where ||Φ W||1 = N i=1 N j=1 φi,j|wi,j| and Φ is a matrix with the (i, j)th element as φi,j. Proposition 3 shows that the objective function of (9) leads to a sparse similarity matrix due to the ℓ1 regularization. That is, wi,j approaches to 0 if φi,j is large. This is consistent with the intuition that two data points are dissimilar if they are distant. Problem (9) is similar to the model proposed for sparse structure learning (Lake and Tenenbaum 2010). The key difference is that the latter model has a nonnegative constraint on W, i.e. W 0, while our model does not have this constraint. In addition, MEU (Lawrence 2012) for estimating graph structure by imposing ℓ1 regularizer over the kernel matrix based on the assumption that original features are identically independent. However, our model takes weighted ℓ1 regularizer over W and is derived from the independence of latent variables, which is commonly used in spectral methods. A very recent work (Mao, Wang, and Tsang 2016) can learn a sparse similarity matrix for shrinkage effect, but our method effectively controls the deviation of distances using the absolute difference.

Optimization Algorithm We propose to solve the following optimization problem min W,G W, A + Ω(W) log det(G) (10)

s.t. G = diag(W1) W + γI, G 0, W S, where S is a set of symmetric matrices. It is easy to see that problem (10) is a generic problem of (5), (7), and (9). ADMM (Boyd et al. 2011) is employed to solve (10). By introducing Z RN N as the multiplier of the linear matrix equation and τ > 0 as the penalty parameter for the violation of the linear matrix constraint, we have an augmented Lagrangian function of problem (10) as Lτ(G, W, Z, τ) = W, A +Ω(W) log det(G) Z, G gγ(W) + τ

2||G gγ(W)||2 F , where gγ(W) = diag(W1) W +γI. ADMM iteratively solves following subproblems until convergence:

G(t+1) = arg min G 0 τ 2 ||G Q(t)||2 F log det(G), (11)

W(t+1) = arg min W S W, A +Ω(W)+τ

2 ||gγ(W) P(t)||2 F , (12)

Z(t+1) =Z(t) τ(G(t+1) gγ(W(t+1))), (13)

where Q(t) = gγ(W(t)) + 1

τ Z(t) and P(t) = G(t+1) 1 τ Z(t). Next, we discuss detailed methods for solving the above subproblems separately. First, problem (11) has an analytic solution, which is shown in the following proposition.

Proposition 4. Let the eigen-decomposition of Q(t) be Q(t) = VΛVT where V is an orthogonal matrix whose column vectors are eigenvectors of Q(t) and Λ = diag([λi]) is a diagonal matrix with corresponding eigenvalues. The optimal solution of (11) is G(t+1) = Vdiag([ λi])VT where

τ , i. (14)

Next, we solve subproblem (12). Before detailing the optimization method, we present the following lemma, which reduces the number of variables to be optimized. Lemma 1. Given A, problem (12) with Ω deﬁned above has wi,i = 0, i at the optimum. According to Lemma 1 and the symmetry property of W, the optimization problem (12) can be rewritten as a problem with respect to W = {wi,j : (i, j) ES} where ES = {(i, j) : i, j < i j Ni}. Let w be a vectorized representation of optimized variables in W by applying a mapping function ϕ(i, j) = ℓthat maps indexes (i, j) ES to the ℓth element of w. As a result, problem (12) can be written as minw h(t)(w) + Ω(w), where

h(t)(W) = W, A + τ

2 ||gγ(W) P(t)||2 F . (15)

The following proposition shows that the objective function (15) is a quadratic function with respect to w. Proposition 5. Let w be the vectorized representation of variables in W with the mapping function ϕ deﬁned above. Deﬁne φ and p as the vectorizations of Φ and P(t) with the same mapping function, respectively. We have that minimizing the objective function (15) with respect to W is equivalent to minimizing the following quadratic function with respect to w given by

h(t)(w) = τ

2 w T Bw + τw T c

i bib T i + 2I, bi is a binary vector with the ℓth element, and for j = i, if j < i, bi[ϕ(i, j)] = 1, and if j > i, bi[ϕ(j, i)] = 1, and the remaining elements of bi are 0s, and c = 2p + i(γ pi,i)bi + 1 dτ φ . According to Proposition 5, we can equivalently transform the constrained optimization problem (12) into an unconstrained quadratic optimization problem with a smaller number of variables to be optimized. To take full advantage of structure B, we seek optimization methods (Byrd et al. 1995; Schmidt 2010) to solve problem (12) by exploring the ﬁrst order information since we can compute the objective h(t)(w) = τ( 1

2 N i=1(b T i w)2 + ||w||2 + w T c) and its gradient wh(t)(w) = τ( N i=1 bi[b T i w] + 2w + c) very efﬁciently as they only involve the inner product of two vectors, and either bi or w are sparse in general. According to Propositions 4 and 5, the reformulated problem follows the standard form of ADMM since the positive

Teapot Breast cancer

ℓMVU PSL-ℓ2 PSL-ℓ1 ℓMVU PSL-ℓ2 PSL-ℓ1

50 100 150 200 250 300 350 400

50 100 150 200 250 300 350 400

Basal Her2 Lum A Lum B NC Normal like Normal

0 200 400 600 800 1000 1200 1400 1600 1800 2000

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Original data Matrix W Original data Matrix W

0.06 0.04 0.02 0 0.02 0.04 0.06 0.05

0.3 0.2 0.1 0 0.1 0.2 0.3 0.25

0.3 0.2 0.1 0 0.1 0.2 0.3

Basal Her2 Lum A Lum B NC Normal like Normal

0.1 0 0.1 0.2 0.3 0.05

Basal Her2 Lum A Lum B NC Normal like Normal

Basal Her2 Lum A Lum B NC Normal like Normal

Embedded points Embedded points

0 50 100 150 200 250 300 350 400 0.1

0 50 100 150 200 250 300 350 400 0

0 50 100 150 200 250 300 350 400 0

0 500 1000 1500 2000 0

0 500 1000 1500 2000 0

0 500 1000 1500 2000 0

Eigenvalues Eigenvalues

Figure 2: Results of ℓMVU, PSL-ℓ2, and PSL-ℓ1 on Teapot and Breast cancer, including the embedded points in 2-D or 3-D space, the eigenvalues of the learned kernel matrix, and the sparse similarity matrix learned by PSL-ℓ2 and PSL-ℓ1. Eigenvalues are plotted in descending order and the similarity matrix is rearranged in terms of encoded colors in ascending order.

deﬁnite constraint on G automatically holds due to the log determinant function and the vectorized w is unconstrained. As a result, the stopping criterion discussed in Section 3.3.1 of work (Boyd et al. 2011) is used and the penalty parameter τ is varied adaptively according to the primal and dual residues as given in Section 3.4.1 of the same work. The convergence property of ADMM proved in (Boyd et al. 2011) is thus readily adapted to the proposed algorithm. Our method has computational complexity O(N 3), which is the same as that of most of spectral based methods, but is much faster than semi-deﬁnite programming used in MVU.

Experiments Experiments were conducted on various datasets to evaluate the proposed methods. The ﬁrst experiment is to verify the embedded points by visualizing them in 2-D or 3-D space, while the second experiment is to evaluate clustering performance on the embedded points.

Nonlinear Dimensionality Reduction We evaluated the proposed methods on datasets in which the high-dimensional points were sampled from a lowdimensional skeleton structure. Two datasets were used in this experiment. One is teapot data (Weinberger, Sha, and Saul 2004) which consists of 400 color images of a teapot viewed from different angles in the plane (rotated 360 ), and each image consists of 76 101 RGB pixels in a 23, 028 dimensional vector space (Weinberger and Saul 2006). The other is breast cancer data, which contains the expression

levels of over 25, 000 gene transcripts obtained from 144 normal breast tissue samples and 1, 989 tumor tissue samples (Mao et al. 2015). All data points are encoded with colors for checking the distribution of embedded points.

Landmark MVU (ℓMVU) (Weinberger, Packer, and Saul 2005) was evaluated for computational consideration, and the number of landmarks was set to 40. As KPCA and MVU have been compared thoroughly in (Weinberger, Sha, and Saul 2004), KPCA is not reported. As the ℓ2 regularized model is a generalized version of PSL in the case of C approaching inﬁnity, we set C = 103. A very broad, spherical Gaussian density with γ = 10 5 is used as a surrogate for the uninformative prior. For PSL-ℓ1, β controls the sparsity of the similarity matrix, so our model does not have a preset number of nearest neighbors. We set β = 103 to promote the sparsity of W.

Figure 2 shows the results of ℓMVU and our proposed methods on Teapot and Breast cancer. First, we observe that all three methods can correctly recover the circle structure of the teapot images, but our proposed methods obtain a skeleton of embedded points that is much smoother than that obtained by ℓMVU. The smoothness of the skeleton structure becomes much clearer on Breast cancer data since the data tends to be very noisy. The skeleton structures of our proposed methods suggest a linear bifurcating progression path, starting from the normal tissue samples, and diverging to either luminal A or basal subtypes. The linear trajectory through luminal A continues to luminal B and to the HER2+ subtype. This is consistent with the branching ar-

Table 1: Clustering results of 11 methods on seven datasets in terms of accuracy and NMI. The best results are in bold.

Dataset COIL20 Isolet Pendigits Satimage USPS YALE-B Letter (N, c) (1440, 20) (3119, 2) (3498, 10)(4435, 6)(2007, 10) (2414, 38) (5000, 26) (D, d) (1024, 84)(617, 165) (16, 9) (36, 6) (256, 32) (1024, 116) (16, 12) Accuracy Kmeans 0.6674 0.5633 0.6544 0.6685 0.6153 0.1081 0.2632 PCA 0.6674 0.5633 0.6527 0.6681 0.6208 0.1110 0.2634 KPCA 0.6694 0.5643 0.7384 0.6728 0.6313 0.1135 0.2752 LLE 0.3493 0.5021 0.1215 0.6510 0.1624 0.0667 0.3084 LE 0.2035 0.5005 0.7607 0.6870 0.3767 0.0597 0.0634 ℓMVU 0.5042 0.5989 0.5692 0.6886 0.3896 0.0684 0.1484 GPLVM 0.6674 0.5630 0.6527 0.6681 0.4709 0.0671 0.2634 MEU 0.3590 0.5476 0.7138 0.7398 0.6129 0.3094 0.2518 SMCE 0.3813 0.5162 0.8333 0.7143 0.6009 0.2933 0.2824 PSL-ℓ2 0.7181 0.6794 0.8208 0.7132 0.6487 0.4138 0.2816 PSL-ℓ1 0.7319 0.5973 0.8468 0.7454 0.7309 0.3521 0.3320 NMI Kmeans 0.7845 0.0117 0.6669 0.6097 0.5657 0.1694 0.3621 PCA 0.7845 0.0117 0.6627 0.6090 0.5664 0.1781 0.3591 KPCA 0.7893 0.0121 0.6846 0.6110 0.5769 0.1742 0.3664 LLE 0.3848 0.0004 0.0059 0.5080 0.0106 0.0810 0.3992 LE 0.2357 0.0003 0.7554 0.6038 0.2943 0.0515 0.0213 ℓMVU 0.6494 0.0292 0.6422 0.5818 0.3858 0.0870 0.1379 GPLVM 0.7845 0.0116 0.6627 0.6090 0.3824 0.0716 0.3591 MEU 0.5067 0.0151 0.7635 0.6826 0.5743 0.4174 0.3341 SMCE 0.4920 0.0165 0.8071 0.6515 0.6176 0.4042 0.3890 PSL-ℓ2 0.8412 0.0952 0.7999 0.6526 0.7130 0.6094 0.3940 PSL-ℓ1 0.8517 0.0392 0.8230 0.6953 0.7641 0.4796 0.4482

chitecture of cancer progression (Greaves and Maley 2012). However, ℓMVU does not obtain a clear structure. Moreover, the skeleton structure learned by PSL-ℓ1 is slightly better than the structure by PSL-ℓ2. This is partially because the varied neighborhood on Breast cancer data is better than ﬁxed neighborhood. We also demonstrate the distribution of eigenvalues. ℓMVU obtained a kernel with few non-zero eigenvalues, while our methods learned a kernel with eigenvalues that follow a power law distribution. Our methods obtain a sparse similarity matrix, while ℓMVU does not. These observations are in line with our theoretical analysis, and imply that our proposed methods can learn smooth skeleton structures of embedded points from noisy data.

Clustering with Dimensionality Reduction

We evaluated clustering performance on embedded points obtained by both proposed methods and existing dimensionality reduction methods. The datasets used in the experiments are listed in Table 1, where the reduced dimensions are set so that retain 95% energy of each dataset after applying PCA. We compared the proposed methods PSL-ℓ2 and PSL-ℓ1 with PCA, KPCA (Sch olkopf, Smola, and Muller 1999), LLE (Belkin and Niyogi 2001), LE (Saul and Roweis 2003), ℓMVU (Weinberger, Packer, and Saul 2005), GPLVM (Lawrence 2005), MEU (Lawrence 2012) and SMCE (Elhamifar and Vidal 2011). For methods with a k-nearest neighborhood graph as the input, we either obtained k using the normal neighborhood selection strategy (Van der Maaten, Postma, and van den Herik 2009) or tuned k in a

range {5, 10, 15, 20, 30, 50, 100} for the best performance. For methods that use Gaussian kernel, we set the bandwidth parameter to the estimated standard deviation within neighborhoods (Weinberger, Sha, and Saul 2004). The parameter β in PSL-ℓ1 and regularization parameter in SMCE were tuned in a large range {0.1, 1, 10, 102, 103}. Other parameters use default setting suggested in drtoolbox (lvdmaaten.github.io/drtoolbox/), MEU and SMCE. Kmeans on the original data was used as the baseline. To alleviate the non-convex issue of Kmeans, we ran Kmeans with 20 random initialization, and the clustering with the best objective value was evaluated in terms of accuracy and normalized mutual information (NMI) (Nie et al. 2009). All methods used the same reduced dimension and conducted Kmeans on embedded points with the number of true clusters. Table 1 shows the clustering results of 11 methods on seven datasets in terms of accuracy and NMI. We make several observations from the results. First, Kmeans on the embedded data points obtained by half the existing dimensionality reduction methods may not be better than PCA, which was also observed in (Van der Maaten, Postma, and van den Herik 2009). Second, the proposed methods can consistently obtain robust results on most of the datasets in terms of both accuracy and NMI. Third, methods with learning a sparse graph structure can obtain better results than methods with a ﬁxed neighborhood graph as input. Although PSL-ℓ2 does not learn a graph, its robust distance modeling alleviate this issue, so the clustering performance is comparable with PSL-ℓ1 and even better than SMCE. Lastly, the proposed methods perform much better than probabilistic models such as GPLVM and MEU due to the strategy of directly modeling the posterior distribution of embedded points. This is also veriﬁed by the superior performance of PSL to MVU where a smooth skeleton might be better for clustering problems than only unfolding the data by preserving distances.

In this paper, we proposed a novel probabilistic model for nonlinear dimensionality reduction. Unlike MVU, we model the posterior distribution of embedded points by preserving the expected pairwise distances encoded in a given neighborhood graph. The duality of this model can be interpreted as maximizing the log-likelihood of a power law distribution over the eigenvalues of a sparse dual matrix, which leads to a smooth skeleton. Two variants of the model are also discussed for noise samples and graph structure learning. The formulated problems are convex and can be efﬁciently solved by ADMM. Extensive experiments demonstrate that the proposed model achieves a smooth skeleton of embedded points and outperforms various existing methods on seven datasets in terms of clustering performance.

Acknowledgments

Ivor W. Tsang is supported by the ARC Future Fellowship FT130100746 and ARC grant LP150100671.

Belkin, M., and Niyogi, P. 2001. Laplacian eigenmaps and spectral techniques for embedding and clustering. In NIPS, volume 14, 585 591. Boyd, S.; Parikh, N. amd Eric, C.; Peleato, B.; and Eckstein, J. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. FTML 3(1):1 122. Burges, C. J. C. 2009. Dimension reduction: a guided tour. FTML 2(4):275 365. Byrd, R. H.; Lu, P.; Nocedal, J.; and Zhu, C. 1995. A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16(5):1190 1208. Curtis, C.; Shah, S. P.; Chin, S.; et al. 2012. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486(7403):346 352. Dud ık, M.; Phillips, S. J.; and Schapire, R. E. 2007. Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. JMLR 8(6). Elhamifar, E., and Vidal, R. 2011. Sparse manifold clustering and embedding. In NIPS, 55 63. Greaves, M., and Maley, C. C. 2012. Clonal evolution in cancer. Nature 481(7381):306 313. Gupta, A. K., and Nagar, D. K. 1999. Matrix variate distributions, volume 104. CRC Press. Lake, B., and Tenenbaum, J. 2010. Discovering structure by learning sparse graph. In Proceedings of the 33rd Annual Cognitive Science Conference. Lawrence, N. D. 2005. Probabilistic non-linear principal component analysis with gaussian process latent variable models. JMLR 6:1783 1816. Lawrence, N. D. 2012. A unifying probabilistic perspective for spectral dimensionality reduction: Insights and new models. JMLR 13(1):1609 1638. Liu, Q., and Ihler, A. T. 2011. Learning scale free networks by reweighted l1 regularization. In AISTATS, 40 48. Mao, Q., and Tsang, I. W. 2010. Parameter-free spectral kernel learning. UAI. Mao, Q.; Yang, L.; Wang, L.; Goodison, S.; and Sun, Y. 2015. Simple PPT: A simple principal tree algorithm. In SDM. Mao, Q.; Wang, L.; and Tsang, I. W. 2016. A uniﬁed probabilistic framework for robust manifold learning and embedding. Machine Learning. Nie, F.; Xu, D.; Tsang, I. W.; and Zhang, C. 2009. Spectral embedded clustering. In IJCAI, 1181 1186. Saul, L. K., and Roweis, S. T. 2003. Think globally, ﬁt locally: unsupervised learning of low dimensional manifolds. JMLR 4:119 155. Schmidt, M. 2010. Graphical model structure learning with l1-regularization. Ph.D. Dissertation, University of British Columbia, Vancouver.

Scholkopf, B., and Smola, A. 2001. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press. Sch olkopf, B.; Smola, A.; and Muller, K. 1999. Kernel principal component analysis. Advances in Kernel Methods - Support Vector Learning 327 352. Smola, A. J., and Kondor, R. 2003. Kernels and regularization on graphs. In COLT. 144 158. Van der Maaten, L.; Postma, E. O.; and van den Herik, H. J. 2009. Dimensionality reduction: A comparative review. Tilburg University Technical Report, Ti CC-TR 2009005. Weinberger, K. Q., and Saul, L. K. 2006. Unsupervised learning of image manifolds by semideﬁnite programming. IJCV 70(1):77 90. Weinberger, K.; Packer, B.; and Saul, L. 2005. Nonlinear dimensionality reduction by semideﬁnite programming and kernel matrix factorization. In AISTATS, 381 388. Weinberger, K.; Sha, F.; and Saul, L. 2004. Learning a kernel matrix for nonlinear dimensionality reduction. In ICML, 106. Xiao, L.; Sun, J.; and Boyd, S. 2006. A duality view of spectral methods for dimensionality reduction. In ICML, 1041 1048. ACM.