# comanifold_learning_with_missing_data__9de7be04.pdf Co-manifold learning with missing data Gal Mishne 1 Eric C. Chi 2 Ronald R. Coifman 1 Representation learning is typically applied to only one mode of a data matrix, either its rows or columns. Yet in many applications, there is an underlying geometry to both the rows and the columns. We propose utilizing this coupled structure to perform co-manifold learning: uncovering the underlying geometry of both the rows and the columns of a given matrix, where we focus on a missing data setting. Our unsupervised approach consists of three components. We first solve a family of optimization problems to estimate a complete matrix at multiple scales of smoothness. We then use this collection of smooth matrix estimates to compute pairwise distances on the rows and columns based on a new multi-scale metric that implicitly introduces a coupling between the rows and the columns. Finally, we construct row and column representations from these multiscale metrics. We demonstrate that our approach outperforms competing methods in both data visualization and clustering. 1. Introduction Dimension reduction plays a key role in exploratory data analysis, data visualization, clustering and classification. Techniques range from the classical PCA and nonlinear manifold learning to deep autoencoders (Tenenbaum et al., 2000; Roweis & Saul, 2000; Belkin & Niyogi, 2003; Coifman & Lafon, 2006; Vincent et al., 2008; Rifai et al., 2011; Kingma & Welling, 2014). These techniques focus on only one mode of the data, often the observations (columns) which are measurements in a high-dimensional feature space (rows), and exploit correlations among the features to reduce the dimension of the feature vectors and obtain the underlying low-dimensional geometry of the observations. 1Department of Mathematics, Yale University, New Haven, CT, USA 2Department of Statistics, North Carolina State University, Raleigh, NC, USA. Correspondence to: Gal Mishne . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). Yet for many data matrices, for example in gene expression studies, recommendation systems, sensor networks, and word-document analysis, correlations exist among both observations and features. In these cases, we seek a method that can exploit the correlations among both the rows and columns of a data matrix to better learn lower-dimensional representations of both. Bi-clustering methods, which extract distinct bi-clusters along both rows and columns, give a partial solution to performing simultaneous dimension reduction on the rows and columns of a data matrix. In certain settings, however, assuming a bi-clustering model is too restrictive and results in breaking up smooth geometries into artificial disjoint clusters that do not match the actual structure of the data. This can occur when the true geometry is one of overlapping rather than disjoint clusters, for example in word-document analysis (Ahn et al., 2010), or when the underlying structure is not one of clusters at all but rather a smooth manifold (Gavish & Coifman, 2012). Thus, we consider a more general viewpoint: data matrices possess geometric relationships between their rows (features) and columns (observations) such that both modes lie on low-dimensional manifolds. Furthermore, the relationships between the rows may be informed by the relationships between the columns, and vice versa. Several recent papers (Gavish & Coifman, 2012; Ankenman, 2014; Mishne et al., 2016; Shahid et al., 2016; Mishne et al., 2017; Yair et al., 2017) exploit this coupled relationship to co-organize matrices and infer underlying row and column embeddings. Further complicating the story is that such matrices may suffer from missing values, due to measurement corruptions and limitations. Missing values can sabotage efforts to learn the low dimensional manifold underlying the data. Specifically, kernel-based methods rely on calculating a similarity matrix between observations, whose eigendecomposition yields a new embedding of the data. As the number of missing entries grows, the distances between points are increasingly distorted, resulting in poor representation of the data in the low-dimensional space (Gilbert & Sonthalia, 2018). Matrix completion algorithms assume the data is low-rank and fill in the missing values by fitting a global linear subspace to the data. Yet, this may fail when the data lies on a nonlinear manifold. Manifold learning in the missing data scenario has been addressed by a few recent papers. Non-linear Principle Co-manifold learning with missing data co-cluster embedding multiscale metric rows columns Figure 1: The three components of our approach: (A) smooth estimates of a matrix with missing entries at multiple scales via co-clustering, (B) a multi-scale metric using the smooth estimates across all scales, yielding an affinity kernel between rows/columns, and (C) nonlinear embeddings of the rows and columns. The multiscale metric between two columns (red and orange) is a weighted Euclidean distance between those columns at multiple scales, given by solving the co-clustering for increasing values of the cost parameters γr and γc. Component Analysis (NLPCA) (Scholz et al., 2005) uses an autoencoder neural network, where the middle layer serves to learn a low-dimensional embedding of the data, and the trained autoencoder is used to fill in missing values. Missing Data Recovery through Unsupervised Regression (Carreira Perpin & Lu, 2011) first fills in the missing data with linear matrix completion methods, then calculates a non-linear embedding of the data and incorporates this embedding in an optimization problem to fill in the missing values. Recently Gilbert & Sonthalia (2018) proposed MR-MISSING which first calculates an initial distance matrix using only non-missing entries and then uses the increase-only-metricrepair (IOMR) method to fix the distance matrix so that it is a metric from which they calculate an embedding. None of these methods consider the co-manifold setting, where the coupled structure of the rows and the columns can be used to fill in the data, and to calculate an embedding. In this paper, we introduce a new method for performing joint dimension reduction on the rows and columns of a data matrix, which we term co-manifold learning, in the missing data setting. We build on two recent lines of work on coorganizing the rows and columns of a data matrix (Gavish & Coifman, 2012; Mishne et al., 2016; 2017) and convex optimization methods for performing co-clustering (Chi et al., 2017; 2018). The former provide a flexible framework for jointly organizing rows and columns but lacks algorithmic convergence guarantees. The latter provides convergence guarantees but does not take full advantage of the multiple scales of the data revealed in the solution. In the first stage of our approach, rather than inferring biclusters at a single scale, we use a multi-scale optimization framework to fill in the data at fine to coarse scales while imposing smoothness on both the rows and the columns. The scales of the solutions are encoded in a pair of joint cost parameters along the rows and columns. Next, we define a new multi-scale metric based on the filled-in matrix across all scales, which we then use to calculate nonlinear embeddings of the rows and columns. Thus our approach yields three results: a collection of smoothed estimates of the matrix, pairwise distances on the rows and columns that better estimate the geometry of the complete data matrix, and corresponding nonlinear embeddings (Fig. 1). We will demonstrate in experimental results that our method reveals meaningful representations in coupled datasets with missing entries, whereas other methods are capable of revealing a meaningful representation only along one of the modes. The paper is organized as follows. We present the optimization framework in Section 2, the new multi-scale metric for co-manifold learning in Section 3 and experimental results in Section 4. 2. Co-clustering an Incomplete Data Matrix We seek a collection of complete matrix approximations of a partially observed data matrix X Rm n that have been smoothed along their row and columns to varying degrees. This collection will serve in computing row and column multi-scale distances to better estimate the pairwise distances of the complete data matrix. Let [m] denote the set of indices {1, . . . , m}, and let Θ [m] [n] be a subset of the indices that correspond to observed entries of X, and let PΘ denote the projection operator of m n matrices onto an index set Θ, i.e. [PΘ(X)]ij is xij if (i, j) Θ and is 0 otherwise. We seek a minimizer U(γr, γc) of the function: 2 PΘ(X U) 2 F + γr Jr(U) + γc Jc(U). (1) The quadratic term quantifies how well U approximates X on the observed entries, while the two roughness penalties, Co-manifold learning with missing data Jr(U) and Jc(U), incentivize smoothness across the rows and columns of U. The nonnegative parameters γr and γc tune the tradeoff between how well U agrees with X over Θ and how smooth U is with respect to its rows and columns. By tuning γr and γc, we obtain estimates of X at varying scales of row and column smoothness. We use roughness penalties of the following forms (i,j) Er Ω( Ui Uj 2) (i,j) Ec Ω( U i U j 2), where Ui (U i) denotes the ith row (column) of the matrix U. The index sets Er and Ec denote the edge sets of row and column graphs that encode a preliminary data-driven assessment of the similarities between rows and columns of the data matrix. The function Ω, which maps [0, ) into [0, ), will be explained shortly. Variations on the optimization problem of minimizing (1) have been previously proposed in the literature. When there is no data missing, i.e. Θ = [m] [n] and Ωis a linear mapping, minimizing the objective in (1) produces a convex bi-clustering problem (Chi et al., 2017). Additionally, if either γr or γc is zero, then we obtain convex clustering (Pelckmans et al., 2005; Hocking et al., 2011; Lindsten et al., 2011; Chi & Lange, 2015). If we take Ωto be a nonlinear concave function, problem (1) reduces to an instance of concave penalized regression-based clustering (Pan et al., 2013; Marchetti & Zhou, 2014; Wu et al., 2016). The convergence properties of our co-clustering procedure will rely on the following two assumptions. Assumption 2.1 The row and column graphs Er and Ec are connected, i.e. the row graph is connected if for any pair of rows, indexed by i and j with i = j, there exists a sequence of indices i k l j such that (i, k), . . . , (l, j) Er. A column graph is connected under analogous conditions. Assumption 2.2 The function Ω: [0, ) 7 [0, ) is (i) concave and continuously differentiable on (0, ), (ii) vanishes at the origin, i.e. Ω(0) = 0, (iii) is increasing on [0, ), and (iv) has finite right directional derivative at the origin. For concreteness, in the rest of this paper, we use the following function Ω 1 ζ + ϵdζ, (3) where ϵ is a small positive number, e.g. 10 12. The key feature of Ωin (3), which satisfies Assumption 2.2, is that when it is used in the roughness penalties (2) small differences between rows and columns are penalized significantly more than larger differences resulting in more aggressive smoothing of small noisy variations and leaving intact more significant systematic variations. Specifically, functions that satisfy Assumption 2.2 are not differentiable at the origin and therefore incentivize sparsity in the differences in the rows and columns. Consequently, small noisy variations between pairs of rows and columns are eliminated completely for sufficiently large γr and γc. This is in contrast with commonly used quadratic penalties. For example, replacing Jr(U) and Jc(U) by quadratic row and column penalties (i,j) Er wij Ui Uj 2 2 (i,j) Ec wij U i U j 2 2, gives a version of matrix completion on graphs (Kalofolias et al., 2014; Rao et al., 2015), where wij and wij are fixed row and column weights inferred from the data. Shahid et al. (2016) also use quadratic row and column penalties to perform joint linear dimension reduction on the rows and columns of a data matrix. Penalties like the ones given in (4), unlike the ones considered in Assumption 2.2, smooth out more significant systematic variations more aggressively and shrink, but do not completely eliminate, small noisy variations. A simple concrete example illuminates the differences between roughness penalties considered in this paper and commonly used existing penalties. Let U1 = 1 0 T and U2 = 1 + δ 0 T and take Ωto be as defined in (3). Then, δ = 10 4 δ = 104 U1 U2 2 = δ 10 4 104 Ω( U1 U2 2) U1 U2 2 2 = δ2 10 8 108 Consider a small difference between the first and second rows of U, e.g. δ = 10 4. Then small differences are penalized the most using the concave Ωand the least by the convex quadratic penalty. Suppose there is a large difference between the first and second rows of U, e.g. δ = 104. Then large differences are penalized the least using the concave Ωand the most by the convex quadratic penalty. The practical consequence of the differences highlighted by the example above is that the convex penalties, either when Ωis linear or quadratic, do not introduce enough smoothing for small differences and too much smoothing for large differences. Indeed, Pan et al. (2013),Marchetti & Zhou (2014), and Wu et al. (2016) showed that superior clustering results Co-manifold learning with missing data Algorithm 1 CO-CLUSTER-MISSING(PΘ(X), γr, γc) 1: Initialize U0, wr,ij, and wc,ij 2: repeat 3: X PΘ(X) + PΘc(Ut 1) 4: {Ut, nr, nc} CVX-BCLST X, γr, γc, wr, wc 5: wr,ij Ω ( Ut,i Ut,j 2) for all (i, j) Er 6: wc,ij Ω ( Ut, i Ut, j 2) for all (i, j) Ec 7: until convergence 8: Return n U(γr, γc) = Ut, X, nr, nc o could be had, when only Jr or Jc is used, by using concave Ω. Chi et al. (2017) also showed that empirically that the solution to the convex bi-clustering problem tended to identify too many bi-clusters and consequently also introduced a one-step reweighted convex bi-clustering refinement, which recovered the true bi-clusters more accurately in simulation experiments. The reweighting refinement can be seen as taking a single step in an iterative algorithm for minimizing (1) inexactly when Ωis concave (Chi et al., 2018; Chi & Steinerberger, 2018). As our method combines bi-clustering of incomplete data at different scales, or values of γr and γc, consequently we extend the reweighting refinement introduced by Chi et al. (2017) in Section 2.1. Our problem formulation (1) is distinct from related problem formulations in the following ways: 1. Rows and columns of U are simultaneously shrunk towards each other as the parameters γr and γc increase. Note that this shrinkage procedure is fundamentally different from methods like the clustered dendrogram, which independently cluster the rows and columns as well as alternating partition tree construction procedures (Gavish & Coifman, 2012; Mishne et al., 2016). 2. Our ultimate goal is not to perform matrix completion (though this is a by-product of our approach) but rather to perform joint row and column dimension reduction. 3. Our work generalizes both (Shahid et al., 2016) and (Chi et al., 2017) in that we seek the flexibility of performing non-linear dimension reduction on the rows and columns of the data matrix, i.e. a more general manifold organization than a co-clustered structure. 4. Instead of determining an optimal single scale of the solution as in (Shahid et al., 2016; Chi et al., 2017), we recognize that the multiple scales of the different solutions can be aggregated to better estimate the underlying geometry, similar to the tree-based Earth mover s distance in (Ankenman, 2014; Mishne et al., 2017). 2.1. Co-Clustering Algorithm The objective function (1) is a difference of convex functions when Ωis concave as in Assumption 2.2 and consequently can be inexactly minimized via a difference-of-convex (DC) algorithm (Horst & Thoai, 1999). The DC algorithm minimizes a sequence of convex surrogate functions. The surrogate functions are coupled in that each surrogate function depends on the minimizer of the previous surrogate function in the sequence. The t + 1th iterate of the DC algorithm is given by Ut+1 = arg min U g(U | Ut), where g(U | Ut) = 1 2 X U 2 F + γr X (i,j) Er wr,ij Ui Uj 2 (i,j) Ec wc,ij U i U j 2 and wr,ij and wc,ij are weights that depend on Ut, i.e. wr,ij = Ω ( Ut,i Ut,j 2) wc,ij = Ω ( Ut, i Ut, j 2), (5) where Ω denotes the right directional derivative of Ω. Minimizing g(U | U) is equivalent to minimizing the objective function of the convex bi-clustering problem for which efficient algorithms have been introduced (Chi et al., 2017). Thus, in the t + 1th iteration, our DC algorithm solves a convex bi-clustering problem where the missing values in X have been replaced with the values of Ut and the weights wr,ij and wc,ij have been computed based on Ut according to (5). Note that the weights are continuously updated throughout the optimization as opposed to the fixed weights in (Chi et al., 2017). This introduces a notion of the scale of the solution into the weights. Algorithm 1 summarizes our DC algorithm, CO-CLUSTER- MISSING, which returns a smooth output matrix U(γr, γc), a filled-in matrix X = PΘ(X) + PΘc(U(γr, γc)) as well as nr and nc, which are the number of distinct rows and columns in U(γr, γc) respectively. CVX-BCLUST denotes the solution to the convex bi-clustering problem. Details on the convex bi-clustering algorithm, such as computational complexity, are presented in the supplement. The CO-CLUSTER-MISSING algorithm has the following convergence guarantee. Proposition 1 Under Assumption 2.1 and Assumption 2.2, the sequence Ut generated by Algorithm 1 has at least one limit point, and all limit points are stationary points of (1). The proof of Proposition 1 is presented in the supplement. 2.2. Co-clustering at multiple scales Initializing Algorithm 1 is very important as the objective function in (1) is not convex. The matrix U(0) is initialized Co-manifold learning with missing data Algorithm 2 Co-manifold learning on an Incomplete Data Matrix 1: Initialize Er, Ec 2: Set d(X i, X j) = 0 and d(Xi , Xj ) = 0 3: Set nr = m, nc = n, k = k0, and l = l0 4: while nr > 1 do 5: while nc > 1 do n U(l,k), X (l,k), nr, nc o CO-CLUSTER- MISSING PΘ(X), γr = 2l, γc = 2k 7: Update row and column distances: d (Xi , Xj ) += d X (l,k) i , X (l,k) j d (X i, X j) += d X (l,k) i , X (l,k) j 8: k k + 1 9: end while 10: l l + 1 11: end while 12: Calculate affinities Ar(Xi , Xj ) and Ac(X i, X j) 13: Calculate embeddings Ψr, Ψc to be the mean of all non-missing values. The connectivity graphs Er and Ec are initialized at the beginning using k-nearest-neighbor graphs, and remain fixed throughout all considered scales. If we observed the complete matrix, employing a sparse Gaussian kernel is a natural way to quantify the local similarity between pairs of rows and pairs of columns. The challenge is that we do not have the complete data matrix X but only the partially observed one PΘ(X). Therefore, we rely only on the observed values to calculate the k-nearest-neighbor graph, based on the distance used by Ram et al. (2013) in an image inpainting problem. Solving the co-clustering problem in Algorithm 1 at a single scale yields the smooth estimate U, the filled-in data matrix X, and nr and nc which are the number of distinct row and column clusters, respectively, identified at that scale through columns and rows merging in U. To obtain a collection of estimates at multiple scales, we solve the optimization problem for pairs of values for γr, γc set at logarithmic scale (Chi & Steinerberger, 2018) until we have converged to single global bi-cluster, i.e. nr = nc = 1. We start with small values of γr = 2l0 and γc = 2k0, where l0, k0 < 0. We compute a co-clustering via Algorithm 1 to obtain the smooth estimate U(l0,k0) = U(2l0, 2k0) used to fill in the data matrix X (l0,k0). Keeping γr fixed, we continue increasing γc by power of 2 and applying the biclustering until the algorithm converges to one cluster along the columns (nc = 1). We then increase γr by power of 2 and reset γc = 2k0. We repeat this procedure at increasing scales of γr = 2l, γc = 2k, until we have converged to a single global bi-cluster. This multiscale procedure yields a collection of filled-in matrices at all scales n X (l,k)o Note that the l and k denote the power of 2 taken for specific row and column cost parameters (γc, γr) in the solution. This is intended as a compact notation that corresponds a pair of parameters (γr, γc) to their solution U(l,k) and filled in estimate X (l,k). 3. Co-manifold learning Kernel-based manifold learning relies on constructing a good similarity measure between points, and a dimension reduction method based on this similarity. The eigenvectors of these kernels are typically used as the new lowdimensional coordinates for the data. Here we leverage having calculated an estimate of the filled-in matrix at mul- tiple scales n X (l,k)o l,k, to define a new metric between rows and columns. This metric encompasses all bi-scales as defined by joint pairs of optimization cost parameters γr, γc. Given a new metric we employ Diffusion maps (Coifman & Lafon, 2006) to obtain a new embedding of the rows and columns. The full algorithm is given in Algorithm 2. 3.1. Multi-scale metric We define a new metric to estimate the geometry of the complete data matrix both locally and globally. For a given pair γr, γc, we calculate the Euclidean distance between rows for the filled-in matrix at that joint scale, weighted by the cost parameters: d X (l,k) i , X (l,k) j = (γrγc)α X (l,k) i X (l,k) j 2 where X (l,k) = PΘ(X) + PΘc(U(l,k)), and we set α = 1/2 to favor local over global structure in our simulations. Our goal is to aggregate distances between a pair of rows (columns) across multiple scales of the solution, to calculate a metric that better recovers the local and global geometry of the data despite the missing values, thus fixing the missing data metric. Having solved for multiple pairs from the solution surface, we sum over all the distances to obtain a multi-scale distance on the data rows: d(Xi , Xj ) = X l,k d X (l,k) i , X (l,k) j . An analogous multi-scale distance is computed for pairs of columns. Note that if there are no missing values, this metric is just the Euclidean pairwise distance scaled by a scalar, so that we recover the original embedding of the complete matrix. This metric takes advantage of solving the optimization for multiple pairs of cost parameters and filling in the missing values with increasingly smooth estimates (as γr and γc Co-manifold learning with missing data increase). It also alleviates the need to identify the ideal scale at which to fill in the points; it is not clear that a single optimal scale actually exists, but rather different points in the matrix may have different optimal scales. As opposed to the partition-tree based metric of Mishne et al. (2017), this metric accounts for all joint scales of the data as U is smoothed across rows and columns simultaneously, thus fully taking advantage of the coupling between both modes. 3.2. Diffusion maps Having calculated a multi-scale metric on the rows and columns throughout the joint optimization procedure, we can now construct a pair of low-dimensional embeddings based on these distances. Specifically we use Diffusion maps (Coifman & Lafon, 2006), but any dimension reduction technique relying on the construction of a distance kernel could be used instead. We briefly review the construction of the diffusion maps for the rows (features) of a matrix but the same can be applied to the columns (observations). Given a distance between two rows of the matrix d(Xi , Xj ), we construct an affinity kernel on the rows. We choose an exponential function, but other kernels can be considered depending on the application: A[i, j] = exp{ d2(Xi , Xj )/σ2}, where σ is a scale parameter. The exponential function enhances locality, as pairs of samples whose distance exceed σ have negligible affinity. One possible choice for σ is to be the median of distances within the data. We derive a row-stochastic matrix P by normalizing the rows of matrix A: P = D 1A, where D is a diagonal matrix whose elements are given by D[i, i] = P j A[i, j]. The eigendecomposition of P yields a sequence of positive decreasing eigenvalues: 1 = λ0 λ1 . . ., and right eigenvectors {ψℓ}ℓ. Retaining only the first d eigenvalues and eigenvectors, the mapping Ψ embeds the rows into the Euclidean space Rd: Ψ : Xi λ1ψ1(i), λ2ψ2(i), . . . , λdψd(i) T. The embedding integrates the local connections found in the data into a global representation, which enables visualization of the data, organizes the data into meaningful clusters, and identifies outliers and singular samples. This embedding is also equipped with a noise-robust distance, the diffusion distance (Coifman & Lafon, 2006). 4. Numerical Experiments The model we consider in the paper is such that the data is not represented by a bi-clustering model but rather at least one of the modes (rows/columns) lies on a low-dimensional manifold. In our experiments we consider three such examples. In the first a manifold structure exists along both rows and columns, and for the second and third the columns belong to disjoint clusters while the rows lie on a manifold: 1. linkage A synthetic dataset with a 1D manifold along the rows and a 2D manifold along the columns. Let {zi}N1 i=1 R3 be points along a helix and let {yj}N2 j=1 R3 be a 2D surface, where we set N1 = 190, N2 = 300. We analyze the matrix of Euclidean distances between the two spatially distant sets of points to reveal the underlying geometry of both rows and columns, X[i, j] = zi yj 2. (6) Other functions of the distance can also be used such as the elastic or Coulomb potential operator (Coifman & Gavish, 2011). Missing values correspond to having access to only some of the distances between pairs of points across the two sets. Note that this is unlike multidimensional scaling, since these are not pairwise distances between all data points, but rather distances between two sets of points. 2. linkage2 A synthetic dataset with a clustered structure along the rows and a two-dimensional manifold along the columns. Let {xi}N1 i=1 R3 be composed of points in 3 Gaussian clouds in 3D and let {yj}N2 j=1 R3 be a two dimensional surface as before, with N1 = 200, N2 = 300. 3. lung500 A real-world dataset composed of 56 lung cancer patients and their gene expression (Lee et al., 2010). We selected the 500 genes with the greatest variance from the original collection of 12,625 genes. Subjects belong to one of four subgroups; they are either normal subjects (Normal) or have been diagnosed with one of three types of cancers: pulmonary carcinoid tumors (Carcinoid), colon metastases (Colon), and small cell carcinoma (Small Cell). For all datasets, the rows and columns of the data matrix are randomly permuted so their natural order does not play a role in inferring the geometry. We evaluate results both qualitatively and quantitatively. We compare our embeddings to three approaches: NLPCA with missing data completion (Scholz et al., 2005), Fast Robust PCA on Graphs (FRPCAG) (Shahid et al., 2016) and Diffusion Maps (DM) (Coifman & Lafon, 2006) on the corrupted data, i.e. incompletely observed data. FRPCAG provides a linear embedding of a low-rank estimate of a data matrix using quadratic row and column penalties, while the other methods output nonlinear embeddings. NLPCA and DM are applied to each mode separately, while our method and FRPCAG take into account the coupled geometry. Comparing to Diffusion maps demonstrates how missing values corrupt the embedding. Note that this is also equivalent to applying our approach for only a single scale of the cost parameters (γr, γc ), as for this choice of parameters the solution U converges to the grand mean of the data. In Figure 2, we display the embeddings of the different Co-manifold learning with missing data NLPCA DM Co-manifold FRPCAG Figure 2: Comparing row and column embeddings of NLPCA, FRPCAG, DM, Ours, for three datasets with 50% missing entries. For each dataset, top / bottom row is embedding of rows / columns of X. For the lung500 dataset, the color of the clusters are as follows: yellow - normal subjects, dark blue - carcinoid, cyan - colon , red - small cell carcinoma. methods for each of the three datasets for both their rows (top) and their columns (bottom), where 50% of the entries have been removed. Both NLPCA and DM reveal the underlying 2D surface structure on the rows in only one of the linkage datasets, and err greatly on the other. DM correctly infers a 1D path for the linkage dataset but it is increasingly noisy. For NLPCA the 1D embedding is not as smooth and clean as the embedding inferred by the comanifold approach. Our method reveals the 2D surface in both cases. FRPCAG is unsuccessful in uncovering the non-linear manifolds underlying either rows or columns in the linkage dataset. For linkage2, the rows do not separate cleanly into disjoint clusters, and for the columns only one of the parameters of the 2D plane is uncovered. For the lung500 data, NLPCA and DM embed the cancer samples such that the normal subjects (yellow) are close to the Colon type (cyan), whereas both our method and FRPCAG separates the normal subjects from the cancer types. This is due to taking into account the coupled structure of the genes and the samples. As opposed to the clustered structure along the samples, the three nonlinear methods reveal a smooth manifold structure to the genes, which is different than the assumed clustered structure a bi-clustering method would infer. The representation yielded by FRPCAG is the least structured but also does not reveal disjoint gene clusters. For plots presenting the datasets, filled-in values at multiple scales and evaluation of metric distortion introduced by the embeddings see the supplement. Manifold learning is not only used for visualization but also for calculating new representations for signal processing and machine learning tasks. Here we calculate clustering accuracy of clustering the low-dimensional representation Co-manifold learning with missing data 10 20 30 40 50 60 70 80 90 percentage of missing values Co-manifold DM-missing NLPCA FRPCAG =1 FRPCAG =100 10 20 30 40 50 60 70 80 90 percentage of missing values Figure 3: Evaluation of k-means clustering applied to embedding of data for increasing percentages of missing values using ARI compared to the ground-truth labels of (top) the 4 cancer types for the lung500 dataset, and (bottom) 3 Gaussian clusters of the linkage2 dataset. of the data. We apply k-means to the column embeddings of each method, with k set to the correct number of clusters in the data, as we want to evaluate the ability of the methods to properly represent the data without being sensitive to empirical estimation of the number of clusters in the data. We use the Adjusted Rand Index (ARI) (Hubert & Arabie, 1985), to measure the similarity between the k-means clustering of the embedding and the ground-truth labels. ARI indicates no agreement between two clusterings by 0 and perfect agreement by 1. We note that Shahid et al. (2016) do not provide guidelines by which to select the appropriate cost parameters γr and γc in the FRPCAG solution. Therefore we calculated the solution using a wide range of possible values and plot the results for worst and best performance to demonstrate the sensitivity to setting these parameters. In contrast, our approach eliminates this parameter selection step. The top panel of Figure 3 compares clustering the embedding of the cancer patients in lung500 by each method for increasing percentage of missing values in the data, where we averaged over 30 realizations of missing entries. For low values of missing data, FRPCAG performs the best but its performance degrades as the percentage of missing values increases. For higher values of missing data (50% and above), our embedding (blue plot) gives the best clustering result and its performance is significantly degraded only at 90% missing values, as opposed to Diffusion maps (red plot). This demonstrates that the metric we calculate is a good estimate of the metric of the complete data matrix. NLPCA (yellow plot) performs worst. The bottom panel of Figure 3 compares clustering the embedding of the three Gaussian clusters in linkage2 for increasing percentage of missing values in the data, where we averaged over 30 realizations of the data itself and the missing entries. Our embedding (blue plot) gives the best clustering result (for 20% missing values and above) and its performance is unaffected by increasing the percentage of missing values up to 80%, as opposed to Diffusion maps (red plot) which is greatly degraded by the missing values. NLPCA (yellow plot) does not perform as well as our approach, with performance decreasing as the percentage of missing values increases. For a good parameter selection FRPCAG (purple plot) performs well for low percentage of missing values, and then performance is drastically impaired above 50%. For a poor parameter selection FRPCAG (green plot) has the worst performance for practically all percentages of missing values. Note that the overall poor performance of FRPCAG in this setting is due to the underlying manifold being nonlinear and the data being high-rank, demonstrating the need for nonlinear embedding approaches. 5. Conclusions In this paper, we presented a new method for learning nonlinear manifold representations of both the rows and columns of a matrix with missing data. We proposed a new optimization problem to obtain a smooth estimate of the missing data matrix, and solved this problem for different values of the cost parameters, which encode the smoothness scale of the estimate along the rows and columns. We leverage calculating these multi-scale estimates into a new metric that aims to capture the geometry of the complete data matrix. This metric is then used in a kernel-based manifold learning technique to obtain new representations of both the rows and the columns. In future work, we will investigate additional metrics in a general co-manifold setting and relate them to optimal transport problem and Earth Mover s Distance (Coifman & Leeb, 2013). We also intend to develop efficient solutions to accelerate the optimization in order to address large-scale datasets, in addition to the smallscale regime we demonstrate here. We note that the datasets considered here, while being small-scale in the observation domain are high-dimensional in the feature domain, which is a non-trivial setting, and indeed a challenge for supervised methods such as deep learning due to limited training data. Finally, we note that in this work we considered the case where the missingness pattern was completely at random. We leave it to future work to study the more complicated problem setting where data may be missing in a more systematic fashion. Co-manifold learning with missing data Acknowledgements ECC thanks the National Science Foundation for partial support of this work under grant DMS-1752692. GM and RRC were supported by the NIH under grant R01NS100049, and by the National Institute Of Neurological Disorders And Stroke (NINDS) and the National Institute Of Biomedical Imaging And Bioengineering (NIBIB) under grant R01EB026936. Ahn, Y.-Y., Bagrow, J. P., and Lehmann, S. Link communities reveal multiscale complexity in networks. Nature, 466(7307):761 764, 2010. Ankenman, J. I. Geometry and Analysis of Dual Networks on Questionnaires. Ph D thesis, Yale University, 2014. Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373 1396, 2003. Carreira-Perpi n an, M. A. and Lu, Z. Manifold learning and missing data recovery through unsupervised regression. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pp. 1014 1019, 2011. Chi, E. C. and Lange, K. Splitting methods for convex clustering. Journal of Computational and Graphical Statistics, 24(4):994 1013, 2015. Chi, E. C. and Steinerberger, S. Recovering trees with convex clustering. SIAM Journal on Mathematics of Data Science, 2018. in press. Chi, E. C., Allen, G. I., and Baraniuk, R. G. Convex biclustering. Biometrics, 73(1):10 19, 2017. Chi, E. C., Gaines, B. R., Sun, W. W., Zhou, H., and Yang, J. Provable convex co-clustering of tensors. ar Xiv:1803.06518 [stat.ME], 2018. Coifman, R. R. and Gavish, M. Harmonic analysis of digital data bases. In Wavelets and Multiscale Analysis, pp. 161 197. Springer, 2011. Coifman, R. R. and Lafon, S. Diffusion maps. Applied and Computational Harmonic Analysis, 21(1):5 30, 2006. Coifman, R. R. and Leeb, W. E. Earth mover s distance and equivalent metrics for spaces with hierarchical partition trees. Technical report, Yale University, 2013. Technical report YALEU/DCS/TR1482. Gavish, M. and Coifman, R. R. Sampling, denoising and compression of matrices by coherent matrix organization. Applied and Computational Harmonic Analysis, 33(3): 354 369, 2012. Gilbert, A. C. and Sonthalia, R. Unrolling swiss cheese: Metric repair on manifolds with holes. In 56th Annual Allerton Conference on Communication, Control, and Computing, 2018, 2018. Hocking, T., Vert, J.-P., Bach, F., and Joulin, A. Clusterpath: An algorithm for clustering using convex fusion penalties. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 745 752, 2011. Horst, R. and Thoai, N. V. DC programming: Overview. Journal of Optimization Theory and Applications, 103(1): 1 43, Oct 1999. Hubert, L. and Arabie, P. Comparing partitions. Journal of Classification, 2(1):193 218, 1985. Kalofolias, V., Bresson, X., Bronstein, M., and Vandergheynst, P. Matrix completion on graphs. ar Xiv:1408.1717 [cs.LG], 2014. Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014. Lee, M., Shen, H., Huang, J. Z., and Marron, J. Biclustering via sparse singular value decomposition. Biometrics, 66 (4):1087 1095, 2010. Lindsten, F., Ohlsson, H., and Ljung, L. Just relax and come clustering! A convexification of k-means clustering. Technical report, Link opings Universitet, 2011. Marchetti, Y. and Zhou, Q. Solution path clustering with adaptive concave penalty. Electronic Journal of Statistics, 8(1):1569 1603, 2014. Mishne, G., Talmon, R., Meir, R., Schiller, J., Lavzin, M., Dubin, U., and Coifman, R. R. Hierarchical coupledgeometry analysis for neuronal structure and activity pattern discovery. IEEE Journal of Selected Topics in Signal Processing, 10(7):1238 1253, 2016. Mishne, G., Talmon, R., Cohen, I., Coifman, R. R., and Kluger, Y. Data-driven tree transforms and metrics. IEEE Transactions on Signal and Information Processing over Networks, 4(3):451 466, 2018. Pan, W., Shen, X., and Liu, B. Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty. Journal of Machine Learning Research, 14:1865 1889, 2013. Pelckmans, K., De Brabanter, J., Suykens, J., and De Moor, B. Convex clustering shrinkage. In PASCAL Workshop on Statistics and Optimization of Clustering Workshop, 2005. Co-manifold learning with missing data Ram, I., Elad, M., and Cohen, I. Image processing using smooth ordering of its patches. IEEE Transactions on Image Processing, 22(7):2764 2774, 2013. Rao, N., Yu, H.-F., Ravikumar, P. K., and Dhillon, I. S. Collaborative filtering with graph information: Consistency and scalable methods. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pp. 2107 2115. Curran Associates, Inc., 2015. Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 833 840, 2011. Roweis, S. T. and Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290 (5500):2323 2326, 2000. Scholz, M., Kaplan, F., Guy, C. L., Kopka, J., and Selbig, J. Non-linear PCA: A missing data approach. Bioinformatics, 21(20):3887 3895, 2005. Shahid, N., Perraudin, N., Kalofolias, V., Puy, G., and Vandergheynst, P. Fast robust PCA on graphs. IEEE Journal of Selected Topics in Signal Processing, 10(4):740 756, 2016. Tenenbaum, J. B., de Silva, V., and Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319 2323, 2000. Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine learning (ICML-08), pp. 1096 1103, 2008. Wu, C., Kwon, S., Shen, X., and Pan, W. A new algorithm and theory for penalized regression-based clustering. Journal of Machine Learning Research, 17(188): 1 25, 2016. Yair, O., Talmon, R., Coifman, R. R., and Kevrekidis, I. G. Reconstruction of normal forms by learning informed observation geometries from data. Proceedings of the National Academy of Sciences, 114(38):E7865 E7874, 2017.