# coclustering_through_optimal_transport__85dfb20e.pdf Co-clustering through Optimal Transport Charlotte Laclau 1 Ievgen Redko 2 Basarab Matei 1 Youn es Bennani 1 Vincent Brault 3 In this paper, we present a novel method for co-clustering, an unsupervised learning approach that aims at discovering homogeneous groups of data instances and features by grouping them simultaneously. The proposed method uses the entropy regularized optimal transport between empirical measures defined on data instances and features in order to obtain an estimated joint probability density function represented by the optimal coupling matrix. This matrix is further factorized to obtain the induced row and columns partitions using multiscale representations approach. To justify our method theoretically, we show how the solution of the regularized optimal transport can be seen from the variational inference perspective thus motivating its use for co-clustering. The algorithm derived for the proposed method and its kernelized version based on the notion of Gromov-Wasserstein distance are fast, accurate and can determine automatically the number of both row and column clusters. These features are vividly demonstrated through extensive experimental evaluations. 1. Introduction Cluster analysis aims to gather data instances into groups, called clusters, where instances within one group are similar among themselves while instances in different groups are as dissimilar as possible. Clustering methods have become more and more popular recently due to their ability to provide new insights into unlabeled data that may be difficult or even impossible to capture for a human being. 1CNRS, LIPN, Universit e Paris 13 - Sorbonne Paris Cit e, France 2CNRS UMR 5220 - INSERM U1206, Univ. Lyon 1, INSA Lyon, F-69621 Villeurbanne, France 3CNRS, LJK, Univ. Grenoble-Alpes, France. Correspondence to: Charlotte Laclau . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). 0The first author of this paper is now a post-doc in CNRS, LIG, Univ. Grenoble-Alpes, France Clustering methods, however, do not take into account the possible existing relationships between the features that describe the data instances. For example, one may consider a data matrix extracted from text corpus where each document is described by the terms appearing in it. In this case, clustering documents may benefit from the knowledge about the correlation that exists between different terms revealing their probability of appearing in the same documents. This idea is the cornerstone of co-clustering (Hartigan, 1972; Mirkin, 1996) where the goal is to perform clustering of both data points and features simultaneously. The obtained latent structure of data is composed of blocks usually called co-clusters. Applications of co-clustering include but are not limited to recommendation systems (George & Merugu, 2005; Deodhar & Ghosh, 2010; Xu et al., 2012), gene expression analysis (Cheng et al., 2008; Hanisch et al., 2002) and text mining (Dhillon et al., 2003a; Wang et al., 2009). As a result, these methods are of an increasing interest to the data mining community. Co-clustering methods are often distinguished into probabilistic methods (e.g., (Dhillon et al., 2003b; Banerjee et al., 2007; Nadif & Govaert, 2008; Wang et al., 2009; Shan & Banerjee, 2010)) and metric based (e.g., (Rocci & Vichi, 2008; Ding et al., 2006)) methods. Probabilistic methods usually make an assumption that data was generated as a mixture of probability density functions where each one of them corresponds to one co-cluster. The goal then is to estimate the parameters of the underlying distributions and the posterior probabilities of each co-cluster given the data. Metric based approaches proceed in a different way and rely on introducing and optimizing a criterion commonly taking into account intraand inter-block variances. This criterion, in its turn, is defined using some proper metric function that describes the geometry of data in the most precise way possible. Both metric and probabilistic approaches are known to have their own advantages and limitations: despite being quite efficient in modeling the data distribution, probabilistic methods are computationally demanding and hardly scalable; metric methods are less computationally demanding but present the need to choose the right distance that uncovers the underlying latent co-clusters structure based on available data. Furthermore, the vast majority of co-clustering methods require the number of co-clusters to be set in advance. This is usu- Co-clustering through Optimal Transport ally done using the computationally expensive exhaustive search over a large number of possible pairs of row and column clusters as in (Keribin et al., 2015; Wyse & Friel, 2012; Wyse et al., 2014). In this paper, we address the existing issues of co-clustering methods described above by proposing a principally new approach that efficiently solves the co-clustering problem from both qualitative and computational points of view and allows the automatic detection of the number of coclusters. We pose the co-clustering problem as the task of transporting the empirical measure defined on the data instances to the empirical measure defined on the data features. The intuition behind this process is very natural to co-clustering and consists in capturing the associations between instances and features of the data matrix. The solution of optimal transportation problem is given by a doublystochastic coupling matrix which can be considered as the approximated joint probability distribution of the original data. Furthermore, the coupling matrix can be factorized into three terms where one of them reflects the posterior distribution of data given co-clusters while two others represent the approximated distributions of data instances and features. We use these approximated distributions to obtain the final partitions. We also derive a kernelized version of our method that contrary to the original case, is based on an optimal transportation metric defined on the space of dissimilarity functions. The main novelty of our work is two-fold. To the best of our knowledge, the proposed approach is a first attempt to apply entropy regularized optimal transport for coclustering and to give its solution a co-clustering interpretation. While Wasserstein distance has already been adapted to design clustering algorithms (Cuturi & Doucet, 2014; Irpino et al., 2014), our idea is to concentrate our attention on the solution of the optimal transport given by the coupling matrix and not to minimize the quantization error with respect to (w.r.t.) Wasserstein distance. We also note that using entropy regularization leads to a very efficient algorithm that can be easily parallelized (Cuturi, 2013). Second, we show that under some plausible assumptions the density estimation procedure appearing from the use of the optimal transport results in the variational inference problem with the minimization of the reversed Kullback-Leibler divergence. The important implications of this difference w.r.t. other existing methods are explained in Section 3. The rest of this paper is organized as follows. In Section 2, we briefly present the discrete version of the optimal transportation problem and its entropy regularized version. Section 3 proceeds with the description of the proposed approach, its theoretical analysis and algorithmic implementation. In Section 4, we evaluate our approach on synthetic and real-world data sets and show that it is accurate and substantially more efficient than the other state-of-the-art methods. Last section concludes the paper and gives a couple of hints for possible future research. 2. Background and notations In this section, we present the formalization of the Monge Kantorovich (Kantorovich, 1942) optimization problem and its entropy regularized version. 2.1. Optimal transport Optimal transportation theory was first introduced in (Monge, 1781) to study the problem of resource allocation. Assuming that we have a set of factories and a set of mines, the goal of optimal transportation is to move the ore from mines to factories in an optimal way, i.e., by minimizing the overall transport cost. More formally, given two empirical probability measures1 ˆµS = 1 NS PNS i=1 δx S i and ˆµT = 1 NT PNT i=1 δx T i defined as uniformly weighted sums of Dirac with mass at locations supported on two point sets XS = {x S i Rd}NS i=1 and XT = {x T i Rd}NT i=1, the Monge-Kantorovich problem consists in finding a probabilistic coupling γ defined as a joint probability measure over XS XT with marginals ˆµS and ˆµT that minimizes the cost of transport w.r.t. some metric l : Xs Xt R+: min γ Π(ˆµS,ˆµT ) M, γ F where , F is the Frobenius dot product, Π(ˆµS, ˆµT ) = {γ RNS NT + |γ1 = ˆµS, γT 1 = ˆµT } is a set of doubly stochastic matrices and M is a dissimilarity matrix, i.e., Mij = l(x S i , x T j ), defining the energy needed to move a probability mass from x S i to x T j . This problem admits a unique solution γ and defines a metric on the space of probability measures (called the Wasserstein distance) as follows: W(ˆµS, ˆµT ) = min γ Π(ˆµS,ˆµT ) M, γ F . The Wasserstein distance has been successfully used in various applications, for instance: computer vision (Rubner et al., 2000), texture analysis (Rabin et al., 2011), tomographic reconstruction (I. Abraham & Carlier, 2016), domain adaptation (Courty et al., 2014), metric learning (Cuturi & Avis, 2014) and clustering (Cuturi & Doucet, 2014; Irpino et al., 2014). This latter application is of a particular interest as Wasserstein distance is known to be a very efficient metric due to its capability of taking into account the 1Due space limitation, we present only the discrete version of optimal transport. For more details on the general continuous case and the convergence of empirical measures, we refer the interested reader to the excellent monograph by (Villani, 2009). Co-clustering through Optimal Transport geometry of data through the pairwise distances between samples. The success of algorithms based on this distance is also due to (Cuturi, 2013) who introduced an entropy regularized version of optimal transport that can be optimized efficiently using matrix scaling algorithm. We present this regularization below. 2.2. Entropic regularization The idea of using entropic regularization dates back to (Schr odinger, 1931). In (Cuturi, 2013), it found its application to the optimal transportation problem through the following objective function: min γ Π(ˆµS,ˆµT ) M, γ F 1 Second term E(γ) = PNS,NT i,j γi,j log(γi,j) in this equation allows to obtain smoother and more numerically stable solutions compared to the original case and converges to it at the exponential rate (Benamou et al., 2015). Another advantage of entropic regularization is that it allows to solve optimal transportation problem efficiently using Sinkhorn-Knopp matrix scaling algorithm (Sinkhorn & Knopp, 1967). In the next section, we explain the main underlying idea of our approach that consists in associating data instances with features through regularized optimal transport. 3. Co-clustering through optimal transport In this section we show how the co-clustering problem can be casted in a principally new way and then solved using the ideas from the optimal transportation theory. 3.1. Problem setup Let us denote by X and Y two random variables taking values in the sets {xr}n r=1 and {yc}d c=1, respectively, where subscripts r and c correspond to rows (instances) and columns (features). Similar to (Dhillon et al., 2003b), we assume that the joint probability distribution between X and Y denoted by p(X, Y ) is estimated from the data matrix A Rn d. We further assume that X and Y consist of instances that are distributed w.r.t. probability measures µr, µc supported on Ωr, Ωc where Ωr Rd and Ωr Rn, respectively. The problem of co-clustering consists in jointly grouping the set of features and the set of instances into homogeneous blocks by finding two assignment functions Cr and Cc that map as follows: Cr : {x1, . . . , xn} {ˆx1, . . . , ˆxg}, Cc : {y1, . . . , yd} {ˆy1, . . . , ˆym} where g and m denote the number of row and columns clusters, and discrete random variables ˆX and ˆY represent the partitions induced by X and Y , i.e., ˆX = Cr(X) = and ˆY = Cc(Y ). To use discrete optimal transport, we also define two empirical measures ˆµr and ˆµc based on X and Y as follows: ˆµr = 1 n Pn i=1 δxi and ˆµc = 1 m Pm i=1 δyi. We are now ready to present our method. 3.2. Proposed approach The main underlying idea of our approach is to use the optimal transportation presented above to find a probabilistic coupling of the empirical measures defined based on rows and columns of a given data matrix. More formally, for some fixed λ > 0 we solve the co-clustering problem through the following optimization procedure: γ λ = argminγ Π(ˆµr,ˆµc) M, γ F 1 where the matrix M is computed using the Euclidean distance, i.e., Mij = xi yj 2. The elements of the resulting matrix γ λ provides us with the weights of associations between instances and features: similar instances and features correspond to higher values in γ λ. Our intuition is to use these weights to identify the most similar sets of rows and columns that should be grouped together to form coclusters. Following (Benamou et al., 2015), this optimization problem can be equivalently rewritten in the following way: min γ Π(ˆµr,ˆµc) M, γ F 1 λ min γ Π(ˆµr,ˆµc) KL(γ ξλ), where ξλ = e λM is the Gibbs kernel. Finally, we can rewrite the last expression as follows: min γ Π(ˆµr,ˆµc) KL(γ ξλ) = min γ C KL(γ ξλ), where C = C1 C2 is the intersection of closed convex subsets given by C1 = {γ Rd d|γ1 = ˆµr} and C2 = {γ Rd d|γT 1 = ˆµc}. The solution of the entropy regularized optimal transport can be obtained using Sinkhorn-Knopp algorithm and has the following form (Benamou et al., 2015): γ λ = diag(α)ξλdiag(β), (2) where α and β are the scaling coefficients of the Gibbs kernel ξλ. In what follows, we show that under some plausible assumptions, we can interpret these two vectors as approximated rows and columns probability density functions. 3.3. Connection to variational inference In order to justify our approach from the theoretical point of view, we first explain how the obtained solution γ Co-clustering through Optimal Transport can be used for co-clustering. As mentioned in (Dhillon et al., 2003b) and later in (Banerjee et al., 2007), the co-clustering can be seen as a density estimation problem where the goal is to approximate the real density p(X, Y ) by a simpler one depending on the obtained co-clustering in a way that it preserves the loss in the mutual information given by I(X, Y ) I( ˆX, ˆY ) where I(X, Y ) = R XY p(x, y) log p(x,y) p(x)p(y)dxdy is the mutual information. This quantity is further shown to be equal to the Kullback-Leibler divergence between the original distribution p(X, Y ) and q(X, Y ) where the latter has the following form: q(x, y) = p(y|ˆy)p(ˆx, ˆy)p(x|ˆx). From this point, one may instantly see that the solution of the optimal transport problem γ has a very similar form as it also represents the joint probability distribution that approximates the original probability distribution p(x, y) given by the Gibbs measure ξλ and also factorizes into three terms. The most important difference, however, lies in the asymmetry of the KL divergence: while (Dhillon et al., 2003b) and (Banerjee et al., 2007) concentrate on minimizing KL(p(X, Y ) q(X, Y )), our idea is different and consists in minimizing KL(q(X, Y ) p(X, Y )). This approach is known in the literature as the variational inference (Bishop, 2006) and exhibits a totally different behaviour compared to the minimization of KL(p(X, Y ) q(X, Y )). As shown by (Bishop, 2006), in variational inference the estimated distribution q(X, Y ) concentrates on the modes of data and remains compact, while the minimizer of KL(p(X, Y ) q(X, Y )) tends to cover the whole surface of the original density and to overestimate its support. As X, Y and ˆX and ˆY represent the observed and unobserved variables, respectively, the natural goal is to try to estimate the distribution p(X, Y | ˆX, ˆY ) of the data given the obtained co-clusters by the simpler variational distribution q(X, Y ). However, as the maximisation of p(X, Y | ˆX, ˆY ) is computationally impossible, it is common to introduce a free distribution q( , ) on the parameters b X and b Y in order to obtain the following decomposition: log p(X, Y )=L(q( ˆX, ˆY ))+KL(q( ˆX, ˆY ) p( ˆX, ˆY |X, Y )), where the lower bound L(q( ˆX, ˆY )) = Z ˆy ˆ Y q(ˆx, ˆy) log p(x, y, ˆx, ˆy) q(ˆx, ˆy) dˆxdˆy is maximized when the KL divergence is minimized. Now, if we assume that p( ˆX, ˆY |X, Y ) follows the Gibbs distribution, i.e. p( ˆX, ˆY |X, Y ) e λM(x,y), we can consider the original formulation of the regularized optimal transport as the variational inference problem: min q KL(q( ˆX, ˆY ) p( ˆX, ˆY |X, Y )) = min γ KL(γ ξλ), where the optimal coupling γ equals to the estimated joint probability q( ˆX, ˆY ). At this point, we know that the coupling matrix can be seen as an approximation to the original unknown posterior density function but the question how one can use it to obtain the clustering of rows and columns has not been answered yet. In order to solve the variational inference problem, it is usually assumed that the variables ˆx, ˆy are independent and thus the variational distribution q(ˆx, ˆy) factorizes as q(ˆx, ˆy) = q(ˆx)q(ˆy). This assumption, however, goes against the whole idea of co-clustering that relies on the existence of a deep connection between these two variables. To this end, we propose to consider the factorization of q(ˆx, ˆy) that has the following form q(ˆx, ˆy) = q(x)q(ˆx, ˆy|x, y)q(y). This particular form follows the idea of structured stochastic variational inference proposed in (Hoffman & Blei, 2015) where a term depicting the conditional distribution between hidden and observed variables is added to the fully factorized traditional setting presented above. As stated in (Hoffman & Blei, 2015), this term allows arbitrary dependencies between observed and hidden variables which can increase the fidelity of the approximation. Following (Bishop, 2006), the optimal estimated densities q(x) and q(y) are controlled by the direction of the smallest variance of p(x) and p(y) respectively. Furthermore, q(x) and q(y) are proportional to the joint densities p(ˆy, y) and p(ˆx, x), i.e., q(x) p(ˆy, y) and q(y) p(ˆx, x). Bearing in mind the equivalence between γ λ and q(ˆx, ˆy), this brings us to the following important conclusions: (1) the matrices diag(α) and diag(β) can be seen as the approximated densities p( ˆY , Y ) and p( ˆX, X); (2) vectors α and β represent the approximated densities p(X) and p(Y ) obtained by summing X and Y out of p( ˆX, X) and p( ˆY , Y ), respectively. According to (Laird, 1978), the non-parametric estimate of the mixing distribution is a piecewise step function where the number of steps depend on the number of components in the mixture. In the cluster analysis, we can assume that p(X) and p(Y ) consist of g and m components, respectively. Then, our goal is to detect these steps based on the estimates given by α and β to obtain the desired partitions. 3.4. Kernelized version and Gromov-Wasserstein distance In this part, we introduce the kernelized version of our method and compare it to the original formulation of our algorithm. In order to proceed, we first define two similarity matrices Kr Rn n and Kc Rd d associated to empirical measures ˆµr, ˆµc thus forming metric-measure Co-clustering through Optimal Transport spaces as in (M emoli, 2011). Matrices Kr and Kc are defined by calculating the pairwise distances or similarities between rows and columns, respectively, without the restriction of them being positive or calculated based on a proper distance function satisfying the triangle inequality. The entropic Gromov-Wasserstein discrepancy in this case is defined as follows (Peyr e et al., 2016): GW(Kr, Kc, ˆµr, ˆµc) = min γ Πˆ µr,ˆ µc ΓKr,Kc(γ) λE(γ) = min T Πˆ µr,ˆ µc i,j,k,l L(Kri,j, Kck,l)γi,jγk,l λE(γ). where γ is a coupling matrix between two similarity matrices and L : R R R+ is an arbitrary lost-function, usually the quadratic-loss or Kullback-Leibler divergence. Based on this definition, one may define the problem of the entropic Gromov-Wasserstein barycenters for similarity or distance matrices Kr and Kc as follows: min K,γr,γc i={r,c} εiΓK,Ki(γi) λE(γi) (3) where K is the computed barycenter and γr Πˆµ,ˆµr, γc Πˆµ,ˆµc are the coupling matrices that align it with Kr and Kc, respectively. εi are the weighting coefficients summing to one, i.e., P i={r,c} εi = 1 that determine our interest in more accurate alignment between Kr and K or Kc and K. The intuition behind this optimization procedure for coclustering with respect to original formulation given in (1) is the following: while in (1) we align rows with columns directly, in (3) our goal is to do it via an intermediate representation given by the barycenter K that is optimally aligned with both Kr and Kc. In this case, we obtain the solutions γr and γc that, similar to (2), can be decomposed as follows: γ r = diag(αr)ξrdiag(βr), γ c = diag(αc)ξcdiag(βc) where ξr = e λMr and ξc = e λMc are Gibbs kernels calculated between the barycenter and row and column similarity matrices using any arbitrary loss-function L as explained before. Finally, based on the analysis presented above, we further use vectors βr and βc to derive row and column partitions. 3.5. Detecting the number of clusters In order to detect the steps (or jumps) in the approximated marginals, we propose to adapt a procedure introduced in (Matei & Meignen, 2012) for multiscale denoising of piecewise smooth signals. This method is of particular interest for us as it determines the significant jumps in the vectors α and β without knowing their number and location, nor a specific threshold to decide the significance of a jump. As the proposed procedure deals with nondecreasing functions, we first sort the values of α and β in the ascending order. Since the procedure is identical for both vectors, we only describe it for the vector α. We consider that the elements {αi}n i=1 of α are the local averages of a piecewise continuous function v : [0, 1[ R R on the intervals In i = [i/n, (i + 1)/n[ defined by the uniform subdivision of step 1/n of the interval [0, 1[. More precisely: αn i = n R In i v(t)dt, i = 0, . . . , n 1. The detection strategy is based on the following cost function: F(In i ) = Pi l=i 1 |αn l+1 αn l | defined for each interval. Therefore, we get the list of the interval suspicious to contain a jump for the subdivision of order n as follows: Ln = {i ; i = argmaxi F(In i )}. This detection should be refined in order to get only significant jumps in our vector α. To this end we use the multiscale representation of α as in (Harten, 1989) and we perform this detection on each scale. On the first scale, we get a coarse version of α by averaging: 2(αn 2i + αn 2i+1), i = 0, . . . , n/2 1. Now, by considering the coarse version of α, we obtain a second list Ln/2 of suspicious intervals as before. After that, these two lists merge in the list Ljumps as follows: a jump will be considered in the interval In 2i or In 2i+1 if the interval In/2 i is also detected as suspicious at the coarse scale. This procedure is iterated [log2 n] times and a jump is observed if a chain of detection exists from fine to coarse scales. Finally, the number of clusters is obtained by g = |Ljumps| + 1. 3.6. Algorithmic implementation We now briefly summarize the main steps of both CCOT and CCOT-GW methods and discuss their peculiarities with respect to each other. The pseudocode of both approaches in Matlab are presented in Algorithm 1 and Algorithm 2, respectively. CCOT First step of our algorithm consists in calculating the cost matrix M and using it to obtain the optimal coupling matrix γ λ by applying the regularized optimal transport. In order to calculate M, row and column instances should both lie in a space of the same dimension. This condition, however, is verified only if the matrix A is squared which occurs rarely in the real-world applications. To overcome this issue, we first subsample the original data set A in a way that allows us to equalize the number of rows and columns and operate with two sets of the same dimension. If we assume that n > d then this new reduced data set is denoted by D Rd d. We repeat the sampling procedure until every individual is picked at least once. Co-clustering through Optimal Transport The next step is to perform for each i = 1, . . . , ns the jump detection on the sorted vectors αi and βi to obtain two lists of the jumps locations Lαi jumps and Lβi jumps and to define the number of row and column clusters g and m. By using them, we obtain the resulting row partition: 1, r Lαi jumps(1) . . . k, Lαi jumps(k 1) < r Lαi jumps(k) . . . |Lαi jumps| + 1, r > Lαi jumps(|Lαi jumps|). The partition for columns Ci c(yc) is obtained in the same way. Finally, we apply the majority vote over all samples partitions to obtain Cr and Cc. Regarding complexity, both Sinkhorn-Knopp algorithm used to solve the regularized optimal transport (Knight, 2008) and the proposed jump detection techniques are known to converge at the linear rate multiplied by the number of samples, i.e., O(nsd). On the other hand, the calculation of modes of the clustering obtained on the generated samples for both features and data instances has the complexity O(ns(n+d)). In the end, the complexity of the whole algorithm is O(ns(n + d)). We also note that in the real-world applications, we usually deal with scenarios where n d ( big data ) or d n ( small data) thus reducing the overall complexity to O(nsn) and O(nsd), respectively. This makes our approach even more computationally attractive. Algorithm 1 Co-clustering through Optimal Transport (CCOT) Input : A - data matrix, λ - regularization parameter, ns - number of sampling Output: Cr, Cc - partition matrices for rows and columns, g, m - number of row and column clusters [n, d] = size(Z) for i = 1 to ns do Di = datasample(Z, d) Mi pdist2(Di, DT i) [αi, βi, γ ] optimal transport(Mi, λ) [Lαi jumps, Ci r, g] jump detection(sort(αi)) [Lβi jumps, Ci c, m] jump detection(sort(βi)) Cr mode(Ci r) Cc mode(Ci c) CCOT-GW As it can be seen from Algorithm 2, CCOT-GW allows to overcome the important disadvantage of CCOT that consists in the need to perform sampling to cluster all data objects. On the other hand, the computational complexity of CCOT is only O(nsd), while for CCOT-GW it scales as O(n2d + d2n). We also note that CCOT-GW offers a great flexibility in terms of the possible data representation used at its input. One may easily consider using any arbitrary kernel function to calculate similarity matrices or even learn them beforehand using multiple-kernel learning approaches. Algorithm 2 Co-clustering through Optimal Transport with Gromov-Wasserstein barycenters (CCOT-GW) Input : A - data matrix, λ - regularization parameter, εr, εc - weights for barycenter calculation Output: Cr, Cc - partition matrices for rows and columns, g, m - number of row and column clusters Kr pdist2(Z, Z) Kc pdist2(ZT, ZT) [βr, βc, γ r , γ c ] gw barycenter(Kr, Kc, λ, εr, εc) [Lβr jumps, Cr, g] jump detection(sort(βr)) [Lβc jumps, Cc, m] jump detection(sort(βc)) 4. Experimental evaluations In this section, we provide empirical evaluation for the proposed algorithms. 4.1. Synthetic data Simulation setting We simulate data following the generative process of the Gaussian Latent Block Models (for details see (Govaert & Nadif, 2013)) and we consider four scenarios with different number of co-clusters, degree of separation and size. Table 1 and Figure 1 present the characteristics of theta simulated data sets and their visualization showing the different co-clustering structures. Table 1. Size (n d), number of co-clusters (g m), degree of overlapping ([+] for well-separated and [++] for ill-separated coclusters) and the proportions of co-clusters for simulated data sets. Data set n d g m Overlapping Proportions D1 600 300 3 3 [+] Equal D2 600 300 3 3 [+] Unequal D3 300 200 2 4 [++] Equal D4 300 300 5 4 [++] Unequal Figure 1. D2, D3 and D4 reorganized w.r.t. the true partitions. We use several state-of-the-art co-clustering algorithms as baselines including ITCC (Dhillon et al., 2003b), Double K-Means (DKM) (Rocci & Vichi, 2008), Orthogonal Nonnegative Matrix Tri-Factorizations (ONTMF) (Ding et al., 2006), the Gaussian Latent Block Models (GLBM) (Nadif & Govaert, 2008; Govaert & Nadif, 2013) and Residual Bayesian Co-Clustering (RBC) (Shan & Banerjee, 2010). Co-clustering through Optimal Transport Table 2. Mean ( standard-deviation) of the co-clustering error (CCE) obtained for all configurations. - indicates that the algorithm cannot find a partition with the requested number of co-clusters. P-values obtained using the non-parametric test of Wilcoxon (Wilcoxon, 1945) that imply significant differences are printed in bold (significance level of 0.05). Data set Algorithms K-means NMF DKM Tri-NMF GLBM ITCC RBC CCOT CCOT-GW D1 .018 .003 .042 .037 .025 .048 .082 .063 .021 .011 .021 .001 .017 .045 .018 .013 .004 .002 D2 .072 .044 .083 .063 .038 .000 .052 .065 .032 .041 .047 .042 .039 .052 .023 .036 .011 .056 D3 .310 .000 .262 .022 .241 .031 .031 .027 .008 .001 D4 .126 .038 .145 .082 .115 .047 .121 .075 .102 .071 .093 .032 .079 .031 We also report the results of K-means and NMF, run on both modes of the data matrix, as clustering baseline. To assess the performance of all compared methods, we compute the co-clustering error (CCE) (Patrikainen & Meila, 2006) defined as follows: CCE((z, w), (ˆz, ˆw)) = e(z, ˆz)+e(w, ˆw) e(z, ˆz) e(w, ˆw), where ˆz and ˆw are the partitions of instances and variables estimated by the algorithm; z and w are the true partitions and e(z, ˆz) (resp. e(w, ˆw)) denotes the error rate, i.e., the proportion of misclassified instances (resp. features). For all configurations, we generate 100 data sets and compute the mean and standard deviation of the CCE over all sets. As all the approaches we compared with are very sensitive to the initialization, we run them 50 times with random initializations and retain the best result according to the corresponding criterion. RBC is initialized with Kmeans. Regarding CCOT we set ns to 1000 for all configurations except D4 which has the same number of rows and columns, and therefore does not require any sampling. For CCOT-GW, we use Gaussian kernels for both rows and columns with σ computed as the mean of all pairwise Euclidean distances between vectors (Kar & Jain, 2011). Finally, we let both CCOT and CCOT-GW detect automatically the number of co-clusters, while for all other algorithms we set the number of clusters to its true value. Co-clustering performance We report the mean (and standard deviation) of co-clustering errors obtained in Table 2. Based on these results, we observe that on D1, which has a clear block structure, all algorithms perform equally well, however CCOT-GW gives the best results, closely followed by CCOT and K-means. Regarding D2, D3 and D4, which have more complicated structure than D1, both CCOT and CCOT-GW significantly outperform all other algorithms and this difference is all the more important on D3 and D4 where some of the compared algorithms are unable to find a partition with the desired number of clusters. Furthermore, we argued that one of the strengths of our method is its ability to detect automatically the number of co-clusters by applying a jump detection algorithm on α and β. From Figure 2 one can observe that the plots of these vectors, obtained with CCOT, with their elements sorted in the ascending order reveal clear steps that correspond to the correct number of clusters and also illustrate their proportions and the degree of overlapping. The same observation is valid for CCOT-GW. Both approaches correctly identified the number of clusters in most cases and CCOT is slightly more accurate than CCOT-GW when the proportions of co-clusters are unbalanced. 0 50 100 150 200 2.5016 2.5026x 10 5 0 50 100 150 200 0.999 Data Set CCOT CCOT-GW g m g m D1 100 100 98 100 D2 83 97 81 72 D3 99 98 93 97 D4 73 86 70 87 Figure 2. Vectors (a) α and (b) β obtained with CCOT on D3. (c) Number of times CCOT and CCOT-GW correctly detect the number of co-clusters (g and m) over 100 trials. To summarize, CCOT and CCOT-GW outperform all the other baselines for the considered data structures and present two important advantages: (1) they do not suffer from the initialization issues, (2) they are able to detect automatically the number co-clusters. 4.2. Movie Lens Data and setting MOVIELENS-100K2 is a popular benchmark data set that consists of user-movie ratings, on a scale of one to five, collected from a movie recommendation service gathering 100,000 ratings from 943 users on 1682 movies. In the context of co-clustering, our goal is to find homogeneous subgroups of users and films in order to further recommend previously unseen movies that were 2https://grouplens.org/datasets/movielens/100k/ Co-clustering through Optimal Transport highly rated by the users from the same group. We set the regularization parameters for CCOT and CCOT-GW using the cross-validation; the number of samplings for CCOT is set to 500 (as the dimensions of the data set are quite balanced); the weights for the barycenter in CCOT-GW are set to ε = (0.5, 0.5). Results In what follows we only present figures and results obtained by CCOT-GW as both algorithms return the same number of blocks and the partitions are almost identical (with a normalized mutual information between partitions above 0.8). CCOT-GW automatically detects a structure consisting of 9 15 blocks, that corresponds to 9 user clusters and 15 movie clusters. From Figure 3, one can observe that the users and the movies are almost equally distributed across clusters, except for two user and three movie clusters which have a larger size than others. 1 2 3 4 5 6 7 8 9 0 User clusters Number of users 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 Movie clusters Number of movies Figure 3. Distribution of the number of (a) users and (b) movies across the clusters obtained with CCOT-GW. Figure 4 shows the original data set as well as a summarized version where each block is represented by its mean rating value (the lighter the block, the higher the ratings), revealing a structure into homogeneous groups. One can observe that the first movie cluster consists of films for which all users agree on giving high ratings (most popular movies) while the last movie cluster consists of the movies with very low ratings. We also report the 5 best rated movies in those two clusters in Table 3. One can easily see that popular movies, such that both Star Wars episodes are in M1 while M5 is composed of movies that were less critically acclaimed. Table 3. Top 5 of movies in clusters M1 and M15. M1 M15 Star Wars (1977) Amytiville: A New Generation (1993) The Lion King (1994) Amytiville: It s About Time (1992) Return of the Jedi (1983) Ninjas: High Noon at Mega Mountain (1998) Contact (1997) Sudden Manhattan (1996) Raiders of the lost ark (1981) Dream Man (1995) We can make similar observations for the interpretation of user clusters. For instance, the last two user clusters include users that tend to give less good ratings to movies than the average population. Also, we note that block (6, 10) cor- Figure 4. (a) MOVIELENS matrix; (b) the matrix summarized by the mean rating (0 ratings are excluded) for each block obtained with CCOT-GW. Darker shades indicate lower values. responds to users who liked movies from M10 better than the rest of the users. These observations are also very similar to the results reported by (Banerjee et al., 2007), where the authors proposed a detailed study of a 10 20 blocks structure for this data set. Additional results can be found in the Supplementary material. 5. Conclusions and future perspectives In this paper we presented a novel approach for coclustering based on the entropy regularized optimal transport. Our method is principally different from other coclustering methods and consists in finding a probabilistic coupling of the empirical measures defined based on the data instances and features. We showed how this procedure can be seen as the variational inference problem and that the inferred distribution can be used to obtain the row and feature partitions. The resulting algorithm is not only more accurate than other state-of-the-art methods but also fast and capable of automatically detecting the number of co-clusters. We also presented an extended version of our algorithm that makes use of the optimal transportation distance defined on similarity matrices associated to the rows and columns empirical measures. In the future, our work can be continued in multiple directions. First, we would like to extend our method in order to deal with the online setting where the goal is to classify a new previously unseen observation without the need to do the co-clustering of the data set that includes it. This can be done using a recent approach proposed in (Perrot et al., 2016) that allows to update the learned coupling matrix using the out-of-sample observations without recomputing it using all the data. We believe that this extension will make our algorithm attractive for the exploitation in real-time industrial recommendation systems due to its computational efficiency. We would also like to study the generalization properties of our algorithm in a spirit similar to the results obtained in (Maurer & Pontil, 2010). This latter work presents a rare case where the generalization bounds are derived for some famous unsupervised learning algorithms. Co-clustering through Optimal Transport Acknowledgements This work has been supported by the ANR project COCLICO, ANR-12-MONU-0001. Banerjee, Arindam, Dhillon, Inderjit, Ghosh, Joydeep, Merugu, Srujana, and Modha, Dharmendra S. A generalized maximum entropy approach to bregman co-clustering and matrix approximation. Journal of Machine Learning Research, 8:1919 1986, December 2007. Benamou, Jean-David, Carlier, Guillaume, Cuturi, Marco, Nenna, Luca, and Peyr e, Gabriel. Iterative Bregman Projections for Regularized Transportation Problems. SIAM Journal on Scientific Computing, 2(37):A1111 A1138, 2015. Bishop, Christopher M. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer Verlag New York, Inc., 2006. Cheng, Kin-On, Law, Ngai-Fong, Siu, Wan-Chi, and Liew, Alan W. Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization. BMC Bioinformatics, 9:210, 2008. Courty, Nicolas, Flamary, R emi, and Tuia, Devis. Domain adaptation with regularized optimal transport. In Proceedings ECML/PKDD 2014, pp. 1 16, 2014. Cuturi, Marco. Sinkhorn distances: Lightspeed computation of optimal transport. In Proceedings NIPS, pp. 2292 2300, 2013. Cuturi, Marco and Avis, David. Ground metric learning. Journal of Machine Learning Research, 15(1):533 564, 2014. Cuturi, Marco and Doucet, Arnaud. Fast computation of wasserstein barycenters. In Proceedings ICML, pp. 685 693, 2014. Deodhar, M. and Ghosh, J. Scoal: A framework for simultaneous co-clustering and learning from complex data. ACM Transactions on Knowledge Discovery from Data, 4(3):1 31, 2010. Dhillon, Inderjit S., Mallela, Subramanyam, and Kumar, Rahul. A divisive information theoretic feature clustering algorithm for text classification. Journal of Machine Learning Research, 3:1265 1287, 2003a. Dhillon, Inderjit S., Mallela, Subramanyam, and Modha, Dharmendra S. Information-theoretic co-clustering. In Proceedings ACM SIGKDD, pp. 89 98, 2003b. Ding, C., Li, T., Peng, W., and Park, H. Orthogonal nonnegative matrix tri-factorizations for clustering. In Proceedings ACM SIGKDD, pp. 126 135, 2006. George, T. and Merugu, S. A scalable collaborative filtering framework based on co-clustering. In Proceedings ICDM, pp. 625 628, 2005. Govaert, G. and Nadif, M. Co-clustering. John Wiley & Sons, 2013. Hanisch, Daniel, Zien, Alexander, Zimmer, Ralf, and Lengauer, Thomas. Co-clustering of biological networks and gene expression data. BMC Bioinformatics, 18(suppl 1):145 154, 2002. Harten, Amiram. Eno schemes with subcell resolution. Journal of Computational Physics, 83:148 184, 1989. Hartigan, J. A. Direct Clustering of a Data Matrix. Journal of the American Statistical Association, 67(337):123 129, 1972. Hoffman, Matthew D. and Blei, David M. Stochastic structured variational inference. In Proceedings AISTATS, volume 38, pp. 361 369, 2015. I. Abraham, R. Abraham, M. Bergounioux and Carlier, G. Tomographic reconstruction from a few views: a multimarginal optimal transport approach. Applied Mathematics and Optimization, pp. 1 19, 2016. Irpino, Antonio, Verde, Rosanna, and De Carvalho, Francisco de A.T. Dynamic clustering of histogram data based on adaptive squared wasserstein distances. Expert Systems with Applications, 41(7):3351 3366, 2014. Kantorovich, Leonid. On the translocation of masses. In C.R. (Doklady) Acad. Sci. URSS(N.S.), volume 37(10), pp. 199 201, 1942. Kar, Purushottam and Jain, Prateek. Similarity-based learning via data driven embeddings. In NIPS, pp. 1998 2006, 2011. Keribin, Christine, Brault, Vincent, Celeux, Gilles, and Govaert, G erard. Estimation and selection for the latent block model on categorical data. Statistics and Computing, 25(6):1201 1216, 2015. Knight, Philip A. The sinkhorn-knopp algorithm: Convergence and applications. SIAM Journal on Matrix Analysis and Applications, 30(1):261 275, March 2008. Laird, N. Nonparametric maximum likelihood estimation of a mixing distribution. Journal of the American Statistical Association, 73:805 811, 1978. Co-clustering through Optimal Transport Matei, Basarab and Meignen, Sylvain. Nonlinear cellaverage multiscale signal representations: Application to signal denoising. Signal Processing, 92:2738 2746, 2012. Maurer, Augusto and Pontil, Massimiliano. K -dimensional coding schemes in hilbert spaces. IEEE Trans. Information Theory, 56(11):5839 5846, 2010. M emoli, Facundo. Gromov-wasserstein distances and the metric approach to object matching. Foundations of Computational Mathematics, 11(4):417 487, 2011. Mirkin, Boris Grigorievitch. Mathematical classification and clustering. Nonconvex optimization and its applications. Kluwer academic publ, Dordrecht, Boston, London, 1996. Monge, Gaspard. M emoire sur la th eorie des d eblais et des remblais. Histoire de l Acad emie Royale des Sciences, pp. 666 704, 1781. Nadif, M. and Govaert, G. Algorithms for model-based block gaussian clustering. In DMIN 08, the 2008 International Conference on Data Mining, 2008. Patrikainen, A. and Meila, M. Comparing subspace clusterings. IEEE Transactions on Knowledge and Data Engineering, 18(7):902 916, July 2006. Perrot, Micha el, Courty, Nicolas, Flamary, R emi, and Habrard, Amaury. Mapping estimation for discrete optimal transport. In NIPS, pp. 4197 4205, 2016. Peyr e, Gabriel, Cuturi, Marco, and Solomon, Justin. Gromov-wasserstein averaging of kernel and distance matrices. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pp. 2664 2672, 2016. Rabin, Julien, Peyr e, Gabriel, Delon, Julie, and Bernot, Marc. Wasserstein barycenter and its application to texture mixing. In Proceedings SSVM, volume 6667, pp. 435 446, 2011. Rocci, R. and Vichi, M. Two-mode multi-partitioning. Computational Statistics and Data Analysis, 52(4): 1984 2003, 2008. Rubner, Yossi, Tomasi, Carlo, and Guibas, Leonidas J. The earth mover s distance as a metric for image retrieval. International Journal on Computer Vision, 40(2):99 121, 2000. Schr odinger, E. Uber die umkehrung der naturgesetze. Sitzungsberichte Preuss. Akad. Wiss. Berlin. Phys. Math., 144:144 153, 1931. Shan, Hanhuai and Banerjee, Arindam. Residual bayesian co-clustering for matrix approximation. In Proceedings of the SIAM International Conference on Data Mining, SDM 2010, April 29 - May 1, 2010, Columbus, Ohio, USA, pp. 223 234, 2010. Sinkhorn, Richard and Knopp, Paul. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21:343 348, 1967. Villani, C edric. Optimal transport : old and new. Grundlehren der mathematischen Wissenschaften. Springer, Berlin, 2009. Wang, Pu, Domeniconi, Carlotta, and Laskey, Kathryn. Latent dirichlet bayesian co-clustering. Machine Learning and Knowledge Discovery in Databases, pp. 522 537, 2009. Wilcoxon, F. Individual Comparisons by Ranking Methods. Biometrics Bulletin, 1(6):80 83, December 1945. Wyse, Jason and Friel, Nial. Block clustering with collapsed latent block models. Statistics and Computing, 22(2):415 428, 2012. Wyse, Jason, Friel, Nial, and Latouche, Pierre. Inferring structure in bipartite networks using the latent block model and exact ICL. Ar Xiv e-prints, 2014. Xu, Bin, Bu, Jiajun, Chen, Chun, and Cai, Deng. An exploration of improving collaborative recommender systems via user-item subgroups. In Proceedings WWW, pp. 21 30, 2012.