# efficient_and_effective_optimal_transportbased_biclustering__42a7497f.pdf Efficient and Effective Optimal Transport-Based Biclustering Chakib Fettal Centre Borelli UMR 9010 Université Paris Cité Informatique Caisse des Dépôts et Consignations chakib.fettal@etu.u-paris.fr Lazhar Labiod Centre Borelli UMR 9010 Université Paris Cité lazhar.labiod@u-paris.fr Mohamed Nadif Centre Borelli UMR 9010 Université Paris Cité mohamed.nadif@u-paris.fr Bipartite graphs can be used to model a wide variety of dyadic information such as user-rating, document-term, and gene-disorder pairs. Biclustering is an extension of clustering to the underlying bipartite graph induced from this kind of data. In this paper, we leverage optimal transport (OT) which has gained momentum in the machine learning community to propose a novel and scalable biclustering model that generalizes several classical biclustering approaches. We perform extensive experimentation to show the validity of our approach compared to other OT biclustering algorithms along both dimensions of the dyadic datasets. 1 Introduction Let G = (U, V, E) be a bipartite graph, which is a graph whose vertices can be divided into two disjoint sets U = {1, 2, . . . , |U|} with |U| = n, V = {1, 2, . . . , |V |} with |V | = d and the set of edges E where each edge connects a vertex of U to a vertex of V . The adjacency matrix for this type of graph has the following structure A = 0n n B B 0d d where B of size n d is called the biadjacency matrix of G, its rows and columns corresponding to the two sets of vertices; each entry represents an edge between a row and a column. Biclustering (or Co-clustering) is the extention of clustering to this type of graph. Following [21], several biclustering models have attempted to solve the problem by viewing B as a two-mode matrix and searching for a simultaneous partition of its rows and columns [9]. In this way, biclustering seeks to reveal subsets of U which exhibit a similar behaviour across a subset of V in matrix B. Biclustering has been used in a number of different contexts. [12] used microarray data to find relations between genes and conditions, finding that genes with similar functions often cluster together. [20] applied this paradigm to data from the US Food and Drug Administration reporting system in order to identify groups of drugs with adverse effects. [11] used it to find market segments among tourists so as to enable more effective targeted marketing. There have been various other applications [9, 33, 19]. Several solutions to the biclustering problem have been proposed in the literature (see [17]). [10] used an information-theoretic approach to solve the problem by minimizing the difference in mutual 36th Conference on Neural Information Processing Systems (Neur IPS 2022). information between B and a summary matrix; they implicitly assume that the data points are generated from a Poisson latent block model [18]. [3] adapted classical modularity to bipartite networks and then used it to identify modules within them. [35] proposed a biclustering paradigm based on nonnegative matrix tri-factorization of the biadjacency matrix. Recently, Optimal Transport (OT) has taken the machine learning community by storm. OT has helped to solve a variety of data mining problems, and biclustering is no exception. [25] proposed two models for biclustering: a first model, CCOT, which does co-clustering based on the scaling vectors obtained by applying the Sinkhorn-Knopp algorithm on a square subsampled version of matrix B, and a second model, CCOT-GW, which uses scaling vectors obtained by computing entropic Gromov-Wasserstein barycenters, and which does not require subsampling. Then came [34], where the authors did biclustering by minimizing a new metric, COOT, which generalizes the Gromov-Wasserstein distance between B and a summary matrix, similarly to what was done in [10]. More specifically, they proposed two new metrics: COOT, together with an entropically regularized metric COOTλ. However, both [25] and [34] have certain drawbacks. First, both algorithms do not tackle the biclustering from the beginning; the co-clusters are deduced at the convergence. Thereby biclustering is a consequence and not a main goal. Secondly, they suffer from high computational complexity; CCOT and CCOT-GW also consume large amounts of memory. Finally, we will see that these algorithms are not suited to dyadic sparse data. In this paper, while integrating the biclustering objective from the beginning, we propose a generic framework for biclustering through optimal transport, which generalizes some previous biclustering approaches. We propose two efficient methods for solving this problem: one that gives an almost hard biclustering, and another that gives a fuzzy or soft biclustering through entropic regularization. These methods outperform other optimal transport biclustering models, in terms of both document and term clustering, on several regular and large scale datasets, while being more computationally and memory efficient. We emphasize once again that the approach we propose is specifically tailored to datasets consisting of dyadic data. 2 Methodology Notations. In what follows, n = {p Rn +| Pn i=1 pi = 1} denotes the n-dimensional standard simplex. Π(w, v) = {Z Rn k + |Z1 = w, Z 1 = v} denotes the transportation polytope, where w n and v k are the marginals of the joint distribution Z and 1n is a vector of ones. Matrices are denoted with uppercase boldface letters, and vectors with lowercase boldface letters. For a matrix M, its i-th row is mi and its j-th column is m j We have that . 0 is the 0-norm which returns the number of nonzero elements of its argument. 2.1 Preliminaries We first need to introduce exact discrete OT and its entropically regularized counterpart, and show how biclustering can be posed as an integer program. Discrete OT as a linear program. The goal of discrete optimal transport is to find a minimal cost transport plan between a source probability distribution w and a target distribution v. Here we are interested in the discrete case of the Kantorovich formulation of OT, that is OT(M, w, v) min Z Π(w,v) M, Z (2) where M Rn k is the cost matrix, and mij quantifies the effort needed to transport a probability mass from wi to vj. Discrete entropy regularized OT. It has been suggested in the literature [6, 5] that the use of a regularization such as entropic regularization can lead to better computational and statistical efficiency. OTλ(M, w, v) min Z Π(w,v) M, Z λH (Z) (3) where H is the entropy defined as H(Z) P i,j zij log zij and λ controls the strength of regularization. The computational efficiency comes from the fact that the unique solution of this problem is of the structure Z := diag(a) exp( M/λ)diag(b), a rescaled elementwise negative exponential of the cost M, where a and b are scaling vectors. These vectors can be found efficiently using the Sinkhorn-Knopp algorithm. Biclustering as an integer program. The Block seriation problem [27] consists in finding two permutation matrices, one for the rows and one for the columns s.t. dense blocks appear along the diagonal of the permuted matrix. A possible definition of the block seriation problem is as follows: given a matrix B Rn d s.t bij gives the strength of the association between row i and column j (such as in the case of a biadjacency matrix, for example), we have i,j bijcij (4) subject to i, j cij {0, 1} i, j, i , j cij + cij + ci j ci j 2 ci j + ci j + cij cij 2 ci j + cij + cij ci j 2 cij + ci j + ci j cij 2 A solution C is a block diagonal matrix up to a permutation of its rows and columns. The block seriation problem is an integer programming problem that is NP-hard. One approach for solving this problem uses a simplified version where a rank constraint rank(C) k is added for k the number of desired biclusters. Integrating this constraint into (4), we can define a new problem by low-rank factorization of C, i.e. C = ZW , which we formulate as max Z Γ(n,k) W Γ(d,k) i,j,h bijzihwjh (5) where Γ(n, k) = {Z {0, 1}n k | Z1 = 1} is the set of hard partitions of dimension n k. A simple heuristic for solving this problem involves alternatingly solving for Z given W, and vice-versa, using classical clustering algorithms, before identifying biclusters through the rearranged matrix C, which displays a block diagonal structure, as shown in figure 1a. The biclusters are identified by grouping together the rows and columns that form a block along the diagonal. 2.2 Biclustering using Optimal Transport Here we propose a new biclustering problem based on block seriation and optimal transport. For this purpose we first define what we term an anti-adjacency matrix. Note that a similar concept has been discussed in [36]. Definition 1 (Anti-adjacency matrix) Given a graph characterized by an adjacency matrix A, we have a corresponding anti-adjacency matrix A s.t. aij quantifies the discrepancy between nodes i and j. We consider a bipartite graph characterized by its biadjacency matrix B = (bij) Rn d. The rows of B are endowed with weights w n and its columns with weights v d. We also consider a row exemplar distribution r r and a column exemplar distribution c c. Depending on the availability of a priori information about the data, these weight vectors can be set to uniform distributions. Now let its anti-biadjacency matrix be B = L(B), where L : Rn d Rn d means that bij, the association between node i and node j, is transformed into a discrepancy measure L(B)ij. Thus, we define the optimal transport block seriation problem as the following bilinear program BCOT(w, v, r, c) min Z Π(w,r) W Π(v,c) i,j,k L(B)ijzikwjk min Z Π(w,r) W Π(v,c) L(B), ZW (6) where Z is a transport plan (or coupling) between between the row distribution w and the row exemplar distribution r, and similarly for W w.r.t. the column distribution v and the column exemplar distribution c. (a) Block Seriation. Figure 1: Biclusters formed using three different methods on the Pubmed dataset. Classical block seriation results in a biclustering that is hard. BCOT results in a biclustering that is almost hard with few nonzero entries outside the main block diagonal. BCOTλ results in a soft biclustering with many nonzero elements outside the block diagonal. Inducing a biclustering via BCOT. We will now show how to obtain a partition of the rows and the columns given a solution pair (Z, W). In what follows our aim is to identify an almost-hard clustering couple for rows and columns from the couplings Z and W. Definition 2 (h-almost hard clustering) We define an h-almost hard clustering as a clustering whose assignment matrix is C Rn k s.t. C 0 = n + h and for each row c of C we have that c 0 > 0. When h = 0, we obtain a standard hard clustering with one non-zero element per row. Proposition 1 1 For w, v, r and c containing no zeros, there exists an optimal pair of coupling matrices Z and W that are h-almost hard clusterings with h {0, . . . , k 1}. Furthermore, when n = k (resp. d = k) and w = r (resp. v = c), this Z (resp. W) becomes a hard clustering, i.e., Z Γ(n, n) (resp. W Γ(d, d)). This means that the solutions are already almost a hard partition of the data, since k << n, d. To obtain a final hard clustering in the strict sense, we assign each row (resp. column) to the one corresponding to the row of Z (resp. W) with the largest value. This should not significantly change the structure of the solution. Figure 1b provides an illustration: here we see the block diagonal structure generated by the product of the two coupling matrices C = ZW , with its similarity in appearance to the biclustering produced by the hard block seriation 1a, apart from a few nonzero entries off the block diagonal that are hard to see immediately. Intuition for BCOT. To explain the intuition behind the proposed approach we need to look at how the problem is solved. The optimization procedure as described in algorithm 1 consists in alternating between the computation of an optimal transport plan Z given W and vice versa. As regards solving for Z given W, the problem can be rewritten as BCOT(w, v, r, c) min Z Π(w,r) L(B)W, Z . (7) This is an optimal transport problem with L(B)W as the cost matrix. The resulting transport plan Z can be seen as a kind of row cluster assignment matrix: if zih > 0, then row i is assigned to cluster h. The same holds for W, which can be seen as a column cluster assignment matrix. This also means that since L(B) is the dissimilarity between the rows and the columns, then the cost matrix L(B)W represents the dissimilarity between rows and row exemplars (or representatives or centroids). In particular, L(B)iwh is the dissimilarity or cost of probability mass transportation between row i and row cluster exemplar h. The reasoning is the same for the columns and the optimal coupling W. Low-rank optimal transport. Biclustering is the main purpose of the approach we proposed, but there is another interesting use case. Proposition 2 For equal target row and column representative distributions, i.e., r = c, and containing no zero entries, then given a solution pair Z and W to BCOT, the matrix Q = Z diag(1/r)W is an approximation of the optimal transport plan that is a solution to problem (2) and whose rank is at most min(rank(Z), rank(W)). 1Proofs for the propositions are given in the appendix. Some recent works [16, 31] have suggested that this kind of low-rank regularization is preferable to entropic regularization as regards certain aspects. For example, the rank parameter is easier to select, since it has simple bounds (an integer between 1 and n). This may be contrasted with the regularization strength λ in the Sinkhorn algorithm, which is continuous. 2.3 Fuzzy Biclustering via Regularized Optimal Transport As previously mentioned, using entropic regularization may be interesting because of its various useful features including statistical and computational efficiency. However, another feature of entropic regularization is that the optimal couplings Z and W are dense matrices as a consequence of the structure of the optimal solution of entropically regularized OT problems. We formulate the problem as follows BCOTλ(w, v, r, c) min Z Π(w,r) W Π(v,c) L(B), ZW λZH (Z) λWH (W) (8) where λZ and λW are the regularization parameters. Fuzzy block seriation. We propose a fuzzy variant of the block seriation problem that allows us by extension to define a fuzzy variant for BCOT using entropic regularization. Let the fuzzy block seriation problem be defined as max Z Γs(n,k) W Γs(d,k) i,j,h bijzihwjh + Ω(Z, W) (9) where Ω(Z, W) is some regularization term introduced to make the partition matrices Z and W dense (for example, entropic regularization or low-rank constraints), and Γs(n, k) = {Z Rn k + |Z1 = 1} is the set of fuzzy partitions. Intuitively, for a solution pair (Z, W), up to a constant factor, each entry in the block seriation matrix C = ZW can be seen as the probability of its corresponding row and column belonging to the same bicluster i.e. cij = ziwj = Pr h=1 zihwjh = p(bi, b j) = Pr h=1 p(bi, b j h). It is easy to see how problem (9) is related to problem (8) and that the couplings corresponding to solutions of the problem give the probability that the different rows and columns belong to the same biclusters. Figure 1c shows biclusters produced by the solutions of BCOTλ. Similarly to BCOT, a block diagonal structure is formed. However, there are also several off-block diagonal nonzero entries that represent the probabilities of the row-column pairs belonging to the same biclusters. 3 Links to Existing Work 3.1 Modularity Maximization in Bipartite Graphs [3]. This model is able to co-cluster binary and contingency matrices by directly maximizing an adapted version of the modularity measure traditionally used for networks. The criterion that it optimizes is max Z Γ(n,k) W Γ(d,k) i,j,h zihwjh By setting L(B) = (B 1 b.. B11 B), this problem becomes equivalent to ours; the difference is in the constraints on Z and W. 3.2 Modularity-Based Sparse Soft Graph Clustering [23]. Here the authors proposed a fuzzy variant of the above problem (although in the context of traditional clustering rather than biclustering). Solving the problem gives, for each element of the dataset, a probability of that element belonging to a given cluster. Our proposed entropic regularization variant represents a kind of extension of this problem to bipartite graphs. 3.3 Directional Co-clustering with a Conscience [30, 1]. This model makes use of the block von Mises-Fisher mixture model for co-clustering directional data on the unit-sphere. It optimizes the following criterion: max Z Γ(n,k) W Γ(d,k) 1 z.hw.h zihwjhbij. (11) In our formulation, if we define L(B) = B and apply cluster size normalization on the optimal transport plans Z = Zdiag(Z 1) 1/2 and W = Wdiag(W 1) 1/2 after computing Z and W respectively in algorithm 1, we obtain a more general version of the algorithm proposed by the authors for solving problem (11). 3.4 Bipartite Correlation Clustering [2]. In the case where the cost function results in a complete bipartite graph with + and - edges with a function L(B)ij = 1 if bij > 0 +1 otherwise (12) we get what is known as Bipartite Correlation Clustering. The solution to this problem maximizes the number of agreements, i.e. the number of all + edges within clusters plus all - edges distributed across clusters. 4 Optimization and Complexity 0 20 40 60 80 100 Iteration Figure 2: Loss for BCOT and BCOTλ on Pubmed. Optimization. Since the block seriation problem is NP-hard, computing an exact solution is prohibitive. An efficient and widely used heuristic for solving these kinds of problems involves the use of block coordinate descent, where row assignments are computed for fixed column assignments, and then vice versa, in alternation. We express the proposed algorithm in pseudo-code as algorithm 1. At each iteration we solve two intermediate optimal transport problems with cost matrices of dimensions n k and d k, since B is generally sparse, and L can be defined such that L(B) retains a similarly sparse structure. The computation of the intermediate cost matrices L(B)W and L(B) Z is reasonably efficient. We also observed that the algorithm does not need many iterations to converge, as shown in figure 2, be it for BCOT or BCOTλ. Algorithm 1: BCOT Input :B bi-adjacency matrix, w and v row and column weights, r and c row and column exemplar distributions Output :πr, πc row and column partitions W Winit; while not converged do Z arg OT (L(B)W, w, r); W arg OT L(B) Z, v, c ; end Generate πr, πc from Z and W; Proposition 3 The computational complexity of the BCOT algorithm 1 when using an exact OT solver is O (tk B 0 + tnk(n + k) log(n + k) + tdk(d + k) log(d + k))), and when using entropic regularization the complexity is O(tk B 0 + tkn + tkd), where t is the number of iterations. In table 1, we report the computational and spatial complexities of the different biclustering approaches. Our model has the same spatial complexity as the COOT variants and a better complexity Table 1: Computational and spatial complexity of the different OT biclustering approaches. For COOT variants, we report complexities for an euclidean cost matrix. For a generic cost, the time complexity is greater. For simplicity, we suppose that d O(n) and that we want a biclustering with the same number of row and column clusters for COOT and CCOT. t denotes the number of iterations and for CCOT, s denotes the number of necessary samplings. Method Spatial complexity Time complexity CCOT O(n2) O(sn3) CCOT-GW O(n2) O(n3) COOT O(nk) O((n + k)nk + k2n + t(n + k)nk log(n + k)) COOT λ O(nk) O((n + k)nk + k2n + tnk) BCOT O(nk) O(k B 0 + t(n + k)nk log(n + k)) BCOTλ O(nk) O(k B 0 + tnk) than CCOT variants. As regards the computational complexity, our model should in most cases be faster with sparse data, and our experiments support this conjecture. For reproducibility, we publicly release our code 2. 5 Experiments We ran experiments using term-document matrices. The benefit of using biclustering on this kind of data is that the resulting biclusters contain both documents and the words that characterize them, which is helpful in interpreting the clustering of the documents. Additional experiments over synthetic and gene expression data are available in the appendix. 5.1 Datasets We evaluate BCOT in relation to six benchmark document-term datasets: ACM, DBLP, Pub Med, Wiki, Ohscal, and 20 Newsgroups. Their characteristics are shown in Table 2. ACM, DBLP, Pubmed and Wiki are attributed networks from which we use only the node-level features that correspond to term-document matrices. We also selected the Ohscal collection and 20 Newsgroups as large-scale document-term matrices to serve as computational efficiency benchmarks. Table 2: Characteristics of the datasets. Dataset #Documents #Terms #Document clusters Sparsity (%) ACM [13] 3025 1870 3 95.52 DBLP [13] 4057 334 4 96.4 Pub Med [32] 19717 500 3 89.98 Wiki [37] 2405 4973 17 86.99 Ohscal [22] 11162 11465 10 99.47 20 Newsgroups [26] 18846 14390 20 99.41 5.2 Experimental Setup In our experiments we define the loss function as L(B) = c B, where c is selected from {1, k, d, n}. For BCOTλ, the regularization parameter lambda is selected from {10 4, 10 3, 10 2, 10 1, 1, 10}. The best hyper-parameters are those that minimize the number of empty clusters. In the case of ties, we select according to the value of the Davies-Bouldin index of the partition [7]. Random restarts are not used for any of the algorithms, including k-means. We use the implementation provided by the authors for CCOT, CCOTλ and CCOT-GW. The code for CCOT was not available, and so we had to implement it based on the code for CCOT-GW. All the reported figures are the averages of 10 runs. 2https://github.com/chakib401/BCOT All the experiments were performed on the same machine with an Intel(R) Xeon(R) CPU and 12GB RAM. For OT solvers we made use of the POT package [15]. 5.3 Document Clustering Metrics. Here, the evaluation is straightforward, we adopt three popular clustering metrics: clustering accuracy (CA), normalized mutual information (NMI) [4], adjusted rand index (ARI) [24]. Table 3: Document clustering performance on the four datasets. OOM denotes out of memory. Method ACM DBLP Pub Med Wiki CA NMI ARI CA NMI ARI CA NMI ARI CA NMI ARI k-Means 51.1 11.3 13.7 11.2 14.0 10.6 36.9 2.4 10.4 2.0 4.3 2.0 52.3 4.7 18.2 10.5 15.3 10.1 26.0 6.1 18.6 9.3 3.3 2.9 CCOT 12.4 2.0 1.0 0.2 0.4 0.2 28.6 0.5 0.6 0.0 0.4 0.0 32.7 0.2 3.0 0.0 3.1 0.1 10.6 0.5 4.9 0.1 0.6 0.15 CCOT-GW 8.1 0.0 1.5 0.0 0.3 0.0 9.4 0.0 1.7 0.0 0.3 0.0 OOM 10.9 0.0 4.3 0.0 0.48 0.0 COOT* 39.0 0.0 1.9 0.0 2.0 0.0 30.5 1.4 1.4 0.3 1.2 0.3 43.2 1.5 1.7 0.6 1.3 1.5 25.9 1.8 28.7 2.2 12.3 1.7 COOTλ 41.5 0.2 1.9 0.1 2.2 0.0 30.6 0.0 0.7 0.0 0.6 0.0 42.4 1.5 1.7 0.5 1.0 1.3 17.2 0.0 1.7 0.0 0.31 0.0 BCOT 76.6 1.5 38.3 2.2 43.3 2.6 61.5 6.2 27.4 4.3 28.3 5.5 53.6 4.5 15.9 1.9 12.9 2.4 49.8 1.5 47.9 1.0 30.6 1.0 BCOTλ 76.2 0.6 37.6 0.8 42.4 1.0 59.4 9.9 26.6 7.6 27.2 9.5 56.5 3.1 18.4 1.3 15.4 1.8 50.8 1.5 49.4 0.9 31.9 0.8 Performance. Document clustering results on ACM, DBLP, Pub Med and Wiki are given in table 3 for the three metrics. In all cases the best result is obtained either by BCOT or by BCOTλ. Moreover, on Wiki, BCOTλ gives competitive results when compared with state-of-the-art attributed graph clustering methods presented in [14], despite not having access to the graph structure information in the Wiki citation network. Efficiency. Figure 3 plots the document clustering performance (accuracy against training time) of the different methods on the two large-scale document-term matrices 20 Newsgroup and Ohscal. BCOT offers the best accuracy while BCOTλ is fastest method on both datasets. We see that for both BCOT and COOT, the entropic-regularized versions outspeed their exact counterparts and that CCOT suffers from very high computation times, due mainly to the fact that this method requires pairwise distance matrices to be computed on the rows and columns. 25 27 29 211 213 215 Average Run Time (s) Clustering Acc (%) k-Means 1.3 CCOT 464.9 (a) 20 Newsgroups. 23 24 25 26 27 28 29 210 Average Run Time (s) Clustering Acc (%) k-Means 2.2 (b) Ohscal. Figure 3: Accuracy against training time on NG20 and Ohscal. BCOTλ is the fastest and has a competitive level of accuracy. BCOT gives the best accuracy while remaining relatively efficient. The multiplication factors shown for the training times take BCOTλ as the reference (and so, for example, 4.5 shown for BCOT means that it is approximately 4.5 times slower than BCOTλ). We were not able to benchmark CCOT-GW since it failed to scale to these datasets. 5.4 Term Clustering Metrics. Unlike document clustering, there is no ground truth partition for terms, so we need to find another way of evaluating term clustering results. One generally acceptable technique is to analyse the semantic coherence of the clusters obtained. To this end we introduce a metric based on point mutual information (PMI). PMI is a frequently used information-theoretic metric for quantifying the relationship between pairs of discrete random variable outcomes. The PMI measure was chosen because prior research [29] has shown that it is closely associated with human judgements in determining word relatedness. The PMI between the terms wi and wj is calculated as PMI(wi, wj) = log p(wi, wj) p(wi)p(wj) (13) In the context of term clustering, given the word co-occurrence matrix K = B B, the PMI is estimated as in PMI(wi, wj) = log k..kij ki.k.j (14) To evaluate a partition of terms P, we propose a metric based on intra and inter PMI metrics as follows: PMIintra(P) = P j P kij (15) PMIinter(P) = P j P kij (16) In this way, a good clustering should reveal a high intra-cluster semantic relatedness, corresponding to higher PMI values. Using the intra and inter PMIs, we propose the following coherence index coherence(P) = 1 P P P |P| (PMIintra(P) PMIinter(P)) . (17) Our reasoning is this: the greater the semantic proximity between terms in the same clusters, and the greater the sematic distance between terms in different clusters, the higher the value of coherence. Results. Since there is no ground truth number of term clusters, we use the cluster number estimations produced by CCOT-GW for all the other models so that it is easy to compare coherence values between them. Comparisons based on different numbers of clusters would favor the model using the larger number of clusters. Table 4 shows the coherences obtained across the different datasets using our approach, along with those of the baselines. It is clear that BCOT succeeds in capturing more semantics than the other approaches since, whatever the dataset, one or other of the two BCOT variants gives the highest coherence. Table 4: Term clustering performance on the four datasets. OOM denotes out of memory. Method ACM DBLP Pub Med Wiki Ng20 Ohscal k-Means 0.19 0.01 0.05 0.03 0.31 0.18 0.28 0.02 0.28 0.04 0.01 0.02 CCOT 0.03 0.00 -0.07 0.06 0.02 0.01 0.02 0.00 0.05 0.00 0.06 0.00 CCOT-GW 0.08 0.00 0.03 0.00 OOM 0.01 0.00 OOM OOM COOT 0.12 0.01 0.07 0.00 0.14 0.01 0.40 0.00 0.43 0.02 0.23 0.01 COOTλ 0.21 0.00 0.04 0.00 -0.00 0.00 -0.08 0.00 -0.02 0.00 -0.13 0.00 BCOT 0.27 0.01 0.22 0.04 0.54 0.03 0.64 0.01 0.79 0.01 0.44 0.00 BCOTλ 0.24 0.00 0.16 0.02 0.57 0.01 0.62 0.01 0.27 0.01 0.35 0.00 5.5 Statistical Significance We performed a Nemenyi post-hoc test [28, 8] with a confidence level of 90% on the document and term clustering results that we obtained, to determine whether our model outperforms other OT biclustering approaches in a statistically significant way. To conduct this test we generated 20 performance rankings of the OT biclustering models based on their performance for each dataset and quality metric pair for both document and term clustering. Figure 4 shows the results of the test. We see that two differently performing groups were identified, one comprising BCOT and BCOTλ and giving better results than the other group comprising the remaining COOT and CCOT variants. This indicates that with this specific number of datasets and metrics the test was unable to tell COOT and CCOT apart in a statistically significant way. 1 2 3 4 5 6 COOT COOT CCOT CCOT-GW Figure 4: Result of the Nemenyi post hoc test. 6 Conclusion Clustering and biclustering through optimal transport is still at a nascent stage, with many challenges remaining unsolved. This paper introduces a novel problem for biclustering using optimal transport that takes into account the sparse nature of certain types of dyadic data such document-term matrices, to enable more computationally efficient resolution. The problem is posed as a bilinear program that we solve using an efficient block coordinate descent algorithm to find a vertex solution. Experiments on a number of document-term datasets suggest that the proposed approach does a good job in finding clusters that correspond to ground truth document classes, while generating semantically coherent partitions for the terms. In this setting, our model outperforms recent OT biclustering methods by a significant margin, while being more computationally efficient. Acknowledgments and Disclosure of Funding This work has been funded by Informatique Caisse des Dépôts et Consignations (ICDC), Association Nationale de la Recherche et de la Technologie (ANRT), and Idex-Spectrans of Université Paris Cité. [1] Séverine Affeldt, Lazhar Labiod, and Mohamed Nadif. Regularized bi-directional co-clustering. Statistics and Computing, 31(3):1 17, 2021. [2] Nir Ailon, Noa Avigdor-Elgrabli, Edo Liberty, and Anke Van Zuylen. Improved approximation algorithms for bipartite correlation clustering. SIAM Journal on Computing, 41(5):1110 1121, 2012. [3] Michael J Barber. Modularity and community detection in bipartite networks. Physical Review E, 76(6):066102, 2007. [4] Deng Cai, Xiaofei He, Xiaoyun Wu, and Jiawei Han. Non-negative matrix factorization on manifold. In 2008 eighth IEEE international conference on data mining, pages 63 72. IEEE, 2008. [5] Lenaic Chizat, Pierre Roussillon, Flavien Léger, François-Xavier Vialard, and Gabriel Peyré. Faster wasserstein distance estimation with the sinkhorn divergence. Advances in Neural Information Processing Systems, 33:2257 2269, 2020. [6] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013. [7] David L Davies and Donald W Bouldin. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, (2):224 227, 1979. [8] Janez Demšar. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7:1 30, 2006. [9] Inderjit S Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 269 274, 2001. [10] Inderjit S Dhillon, Subramanyam Mallela, and Dharmendra S Modha. Information-theoretic coclustering. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 89 98, 2003. [11] Sara Dolnicar, Sebastian Kaiser, Katie Lazarevski, and Friedrich Leisch. Biclustering: Overcoming data dimensionality problems in market segmentation. Journal of Travel Research, 51 (1):41 49, 2012. [12] Michael B Eisen, Paul T Spellman, Patrick O Brown, and David Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25):14863 14868, 1998. [13] Shaohua Fan, Xiao Wang, Chuan Shi, Emiao Lu, Ken Lin, and Bai Wang. One2multi graph autoencoder for multi-view graph clustering. In Proceedings of The Web Conference 2020, pages 3070 3076, 2020. [14] Chakib Fettal, Lazhar Labiod, and Mohamed Nadif. Efficient graph convolution for joint node representation learning and clustering. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, pages 289 297, 2022. [15] Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, et al. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. [16] Aden Forrow, Jan-Christian Hütter, Mor Nitzan, Philippe Rigollet, Geoffrey Schiebinger, and Jonathan Weed. Statistical optimal transport via factored couplings. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2454 2465. PMLR, 2019. [17] Gérard Govaert and Mohamed Nadif. Co-clustering: models, algorithms and applications. John Wiley & Sons, 2013. [18] Gérard Govaert and Mohamed Nadif. Mutual information, phi-squared and model-based coclustering for contingency tables. Advances in data analysis and classification, 12(3):455 488, 2018. [19] Jiajun Gu and Jun S Liu. Bayesian biclustering of gene expression data. BMC genomics, 9(1): 1 10, 2008. [20] Rave Harpaz, Hector Perez, Herbert S Chase, Raul Rabadan, George Hripcsak, and Carol Friedman. Biclustering of adverse drug events in the fda s spontaneous reporting system. Clinical Pharmacology & Therapeutics, 89(2):243 250, 2011. [21] John A Hartigan. Direct clustering of a data matrix. Journal of the american statistical association, 67(337):123 129, 1972. [22] William Hersh, Chris Buckley, TJ Leone, and David Hickam. Ohsumed: An interactive retrieval evaluation and new large test collection for research. In SIGIR 94, pages 192 201. Springer, 1994. [23] Alexandre Hollocou, Thomas Bonald, and Marc Lelarge. Modularity-based sparse soft graph clustering. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 323 332. PMLR, 2019. [24] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2(1): 193 218, 1985. [25] Charlotte Laclau, Ievgen Redko, Basarab Matei, Younes Bennani, and Vincent Brault. Coclustering through optimal transport. In International Conference on Machine Learning, pages 1955 1964. PMLR, 2017. [26] Ken Lang. Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning, pages 331 339, 1995. [27] Jean François Marcotorchino. Block seriation problems: A unified approach. reply to the problem of h. garcia and jm proth (applied stochastic models and data analysis, 1,(1), 25 34 (1985)). Applied Stochastic Models and Data Analysis, 3(2):73 91, 1987. [28] Peter Bjorn Nemenyi. Distribution-free multiple comparisons. Princeton University, 1963. [29] David Newman, Sarvnaz Karimi, and Lawrence Cavedon. External evaluation of topic models. In in Australasian Doc. Comp. Symp., 2009. Citeseer, 2009. [30] Aghiles Salah and Mohamed Nadif. Model-based von mises-fisher co-clustering with a conscience. In Proceedings of the 2017 SIAM International Conference on Data Mining, pages 246 254. SIAM, 2017. [31] Meyer Scetbon, Marco Cuturi, and Gabriel Peyré. Low-rank sinkhorn factorization. In International Conference on Machine Learning, pages 9344 9354. PMLR, 2021. [32] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi Rad. Collective classification in network data. AI magazine, 29(3):93 93, 2008. [33] Jonathan Templin, Robert A Henson, et al. Diagnostic measurement: Theory, methods, and applications. Guilford Press, 2010. [34] Vayer Titouan, Ievgen Redko, Rémi Flamary, and Nicolas Courty. Co-optimal transport. Advances in Neural Information Processing Systems, 33:17559 17570, 2020. [35] Hua Wang, Feiping Nie, Heng Huang, and Fillia Makedon. Fast nonnegative matrix trifactorization for large-scale data co-clustering. In Twenty-Second International Joint Conference on Artificial Intelligence, 2011. [36] Jianfeng Wang, Mei Lu, Francesco Belardo, and Milan Randi c. The anti-adjacency matrix of a graph: Eccentricity matrix. Discrete Applied Mathematics, 251:299 309, 2018. [37] Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y. Chang. Network representation learning with rich text information. In IJCAI, 2015.