# sublinear_time_approximation_of_text_similarity_matrices__0d3b5e23.pdf Sublinear Time Approximation of Text Similarity Matrices Archan Ray, Nicholas Monath , Andrew Mc Callum, Cameron Musco University of Massachusetts Amherst {ray, nmonath, mccallum, cmusco}@cs.umass.edu We study algorithms for approximating pairwise similarity matrices that arise in natural language processing. Generally, computing a similarity matrix for n data points requires Ω(n2) similarity computations. This quadratic scaling is a significant bottleneck, especially when similarities are computed via expensive functions, e.g., via transformer models. Approximation methods reduce this quadratic complexity, often by using a small subset of exactly computed similarities to approximate the remainder of the complete pairwise similarity matrix. Significant work focuses on the efficient approximation of positive semidefinite (PSD) similarity matrices, which arise e.g., in kernel methods. However, much less is understood about indefinite (non-PSD) similarity matrices, which often arise in NLP. Motivated by the observation that many of these matrices are still somewhat close to PSD, we introduce a generalization of the popular Nystr om method to the indefinite setting. Our algorithm can be applied to any similarity matrix and runs in sublinear time in the size of the matrix, producing a rank-s approximation with just O(ns) similarity computations. We show that our method, along with a simple variant of CUR decomposition, performs very well in approximating a variety of similarity matrices arising in NLP tasks. We demonstrate high accuracy of the approximated similarity matrices in tasks of document classification, sentence similarity, and cross-document coreference. 1 Introduction Many machine learning tasks center around the computation of pairwise similarities between data points using an appropriately chosen similarity function. E.g., in kernel methods, a non-linear kernel inner product is used to measure similarity, and often to construct a pairwise kernel similarity matrix. In natural language processing, document or sentence similarity functions (e.g., cross-encoder models (Devlin et al. 2018) or word mover s distance (Piccoli and Rossi 2014; Kusner et al. 2015))) are key components of cross-document coreference (Cattan et al. 2020) and passage retrieval for question answering (Karpukhin et al. 2020). String-similarity functions are used to model name aliases (Tam et al. 2019) and for morphology (Rastogi, Cotterell, and Eisner 2016). Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Now at Google. Computing all pairwise similarities for a data set with n points requires Ω(n2) similarity computations. This can be a major runtime bottleneck, especially when each computation requires the evaluation of a neural network or other expensive operation. One approach to avoid this bottleneck is to produce a compressed approximation to the n n pairwise similarity matrix K for the data set, but avoid ever fully forming this matrix and run in sub-quadratic time (i.e., with running time less than O(n2), or sublinear in the size of K). The compressed approximation, K, can be used in place of K to quickly access approximate pairwise similarities, and in methods for near neighbor search, clustering, and regression, which would typically involve K. 1.1 Existing Methods Similarity matrix approximation is very well-studied, especially in the context of accelerating kernel methods and Gaussian process regression. Here, K is typically positive semidefinite (PSD). This structure is leveraged by techniques like the random Fourier features and Nystr om methods (Rahimi and Recht 2007; Le, Sarl os, and Smola 2013; Williams and Seeger 2001; Yang et al. 2012), which approximate K via a rank-s approximation K = ZZT , for s n and Z Rn s. These methods have runtimes linear in n and sublinear in the matrix size. They have been very successful in practice (Huang et al. 2014; Meanti et al. 2020), and often come with strong theoretical bounds (Gittens and Mahoney 2016; Musco and Musco 2017; Musco and Woodruff 2017). Unfortunately, most similarity matrices arising in natural language processing, such as those based on cross-encoder transformers (Devlin et al. 2018) or word mover s distance (Piccoli and Rossi 2014), are indefinite (i.e., non-PSD). For such matrices, much less is known. Sublinear time methods have been studied for certain classes of similarities (Bakshi and Woodruff 2018; Oglic and G artner 2019; Indyk et al. 2019), but do not apply generally. Classic techniques like low-rank approximation via the SVD or fast low-rank approximation via random sketching (Frieze, Kannan, and Vempala 2004; Sarlos 2006; Drineas, Mahoney, and Muthukrishnan 2008) generally must form all of K to approximate it, and so run in Ω(n2) time. There are generic sublinear time sampling methods, like CUR decomposition (Drineas, Kannan, and Mahoney 2006; Wang, Zhang, and Zhang 2016), which are closely related to Nystr om approximation. However, as we The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) will see, the performance of these methods varies greatly depending on the application. 1.2 Our Contributions Algorithmic. Our first contribution is a simple variant of the Nystr om method that applies to symmetric indefinite similarity matrices1. The Nystr om method (Williams and Seeger 2001) approximates a PSD similarity matrix K by sampling a set of s n landmark points from the dataset, computing their similarities with all other points (requiring O(ns) similarity computations), and then using this sampled set of similarities to reconstruct all of K. See Sec. 2. Our algorithm is motivated by the observation that many indefinite similarity matrices arising in NLP are somewhat close to PSD they have relatively few negative eigenvalues. Thus, a natural approach would be simply to apply Nystr om to them. However, even for matrices with just a few small negative eigenvalues, this fails completely. We instead show how to minimally correct our matrix to be closer to PSD, before applying Nystr om. Specifically, we apply an eigenvalue shift based on the minimum eigenvalue of a small random principal submatrix of K. We call our method Submatrix-Shifted Nystr om, or SMS-Nystr om. SMS-Nystr om is extremely efficient, and, while we do not give rigorous approximation bounds, it recovers the strong performance of the Nystr om method on many near PSD-matrices. Empirical. Our second contribution is a systematic evaluation of a number of sublinear time matrix approximation methods in NLP applications. We consider three applications involving indefinite similarity matrices: 1) computing document embeddings using word mover s distance (Kusner et al. 2015), for four different text classification tasks; 2) approximating similarity matrices generated using cross-encoder BERT (Devlin et al. 2018) and then comparing performance in three GLUE tasks: STS-B (Cer et al. 2017), MRPC (Dolan and Brockett 2005) and RTE (Bentivogli et al. 2009), which require predicting similarity, semantic equivalence, and sentence entailment; 3) approximating the similarity function used to determine coreference relationships across documents in a corpus of news articles mentioning entities and events (Cybulska and Vossen 2014; Cattan et al. 2020). We show that both SMS-Nystr om, and a simple variant of CUR decomposition yield accurate approximations that maintain downstream task performance in all these tasks while greatly reducing the time and space required as compared to the exact similarity matrix. They typically significantly outperform the classic Nystr om method and other CUR variants. 1.3 Other Related Work Our work fits into a vast literature on randomized methods for matrix approximation (Mahoney 2011; Woodruff et al. 2014). There is significant work on different sampling distributions and theoretical bounds for both the Nystr om and CUR methods (Goreinov, Tyrtyshnikov, and Zamarashkin 1While asymmetric similarity matrices do arise, we focus on the symmetric case. In our experiments, symmetrizing and then approximating these matrices yields good performance. 1997; Drineas, Mahoney, and Cristianini 2005; Drineas, Mahoney, and Muthukrishnan 2008; Zhang, Tsang, and Kwok 2008; Kumar, Mohri, and Talwalkar 2012; Wang and Zhang 2013; Talwalkar and Rostamizadeh 2014). However, more advanced methods generally require reading all of K and so require Ω(n2) time. In fact, any method with non-trivial worst-case guarantees on general matrices cannot run less than O(n2) time. If the entire mass of the matrix is placed on a single entry, all entries must be accessed to find it. A number of works apply Nystr om variants to indefinite matrices. Belongie et al. (2002) show that the Nystr om method can be effectively applied to eigenvector approximation for indefinite matrices, specifically in application to spectral partitioning. However, they do not investigate the behavior of the method in approximating the similarity matrix itself. Gisbrecht and Schleif (2015) shows that, in principle, the classic Nystr om approximation converges to the true matrix when the similarity function is continuous over R. However, we observe poor finite sample performance of this method on text similarity matrices. Other work exploits assumptions on the input points e.g. that they lie in a small number of labeled classes, or in a low-dimensional space where distances correlate with the similarity (Schleif, Gisbrecht, and Tino 2018). This later assumption is made implictly in recent work on anchor-net based Nystr om (Cai, Nagy, and Xi 2021), and while it may hold in many settings, in NLP applications, it is often not clear how to find such a low-dimensional representation. By removing the above assumptions, our work is well suited for applications in NLP, which often feed two inputs (e.g., sentences) into a neural network to compute similarities. There is also significant related work on modifying indefinite similarity matrices to be PSD, including via eigenvalue transformations and shifts (Chen, Gupta, and Recht 2009; Gisbrecht and Schleif 2015). These modifications would allow the matrix to be approximated with the classic Nystr om method. However, this work does not focus on sublinear runtime, typically using modifications that require Ω(n2) time. Finally, outside of similarity matrix approximation, there are many methods that seek to reduce the cost of similarity computation. One approach is to reduce the number of similarity computations. Examples include locality sensitive hashing (Gionis et al. 1999; Lv et al. 2007), distance preserving embeddings (Hwang, Han, and Ahn 2012), and graph based algorithms (Orchard 1991; Dong, Moses, and Li 2011) for near-neighbor search. Another approach is to reduce the cost of each similarity computation, e.g., via model distillation for cross-encoder-based similarity (Sanh et al. 2019; Jiao et al. 2019; Michel, Levy, and Neubig 2019; Lan et al. 2019; Zafrir et al. 2019; Humeau et al. 2019). However, model distillation requires significant additional training time to fit the reduced model, unlike our proposed approach which requires only O(ns) similarity computations. There is also work on random features methods and other alternatives to expensive similarity functions, such as those based on the word-movers distance (Cuturi 2013; Wu et al. 2018, 2019). 2 Submatrix-Shifted Nystr om In this section, we introduce the Nystr om method for PSD matrices, and describe our modification of this method for application to indefinite similarity matrices. 2.1 The Nystr om Method Let X = {xi}n i=1 be a dataset with n datapoints, : X X R be a similarity function, and K Rn n be the corresponding similarity matrix with Kij = (xi, xj). The Nystr om method samples s landmark points let S Rn s be the matrix for this sampling. S has a single randomly positioned 1 in each column. KS is an Rn s submatrix of K consisting of randomly sampled columns corresponding to the similarities between all n datapoints and the s landmarks. The key idea is to approximate all pairwise similarities using this sampled set. The Nystr om approximation of K is: K = KS(ST KS) 1ST K. (1) Running Time. Nystr om approximation of (1) requires just O(ns) evaluations of the similarity function to compute KS Rn s. We typically do not form K directly, as it would take at least n2 time to even write down. Instead, we store this matrix in factored form , computing Z = KS(ST KS) 1/2. In this way, we have ZZT = K. I.e., the approximate similarity between points xi and xj is the inner product between the ith and jth rows of Z, which can be thought of as embeddings of the points into Rs. Computing Z requires computing (ST KS) 1/2 the matrix squareroot of (ST KS) 1 which takes O(s3) time using e.g., Cholesky decomposition2. Multiplying by KS then takes O(ns2) time, which is the dominant cost since n > s. Intuition. In (1), ST KS Rs s is the principal submatrix of K containing the similarities between the landmark points themselves. To gain some intuition behind the approximation, consider removing the (ST KS) 1 term and approximating K with KSST K. That is, we approximate the similarity between any two points xi and xj by the inner product between their corresponding rows in KS i.e. the vector in Rs containing their similarities with the landmarks. This would be a reasonable approach when xi and xj are more similar, we expect these rows to have higher dot products. The (ST KS) 1 term intuitively corrects for similarities between the landmark points. Formally, when K is PSD, it can be written as K = BBT for some matrix B Rn n. Thus Kij = bi, bj . Eq. (1) is equivalent to projecting all rows of B onto the subspace spanned by the rows corresponding to the landmark points to produce B, and then letting K = B BT . If e.g., rank(K) s, then rank(B) = rank(K) s and so as long as the rows of B corresponding to the landmark points are linearly independent, we will have B = B and thus K = K. If K is close to low-rank, as is often the case in practice, K will still generally yield a very good approximation. 2If det(ST KS) = 0, (ST KS)+ can be used. 2.2 Nystr om for Indefinite Matrices Our extension of the Nystr om method to indefinite matrices is motivated by two observations. Obs. 1: Text Similarity Matrices are Often Close to PSD. Without some form of structure, we cannot approximate a general n n matrix in less than O(n2) time. Fortunately, while many similarity functions used in natural language processing do not lead to matrices with PSD structure, they do lead to matrices that are close to PSD, in that they have relatively few negative eigenvalues, and very few negative eigenvalues of large magnitude. See Figure 1. 0 50 100 150 200 Eigenvalue indices Eigenvalues MRPC STS-B Twitter Figure 1: The eigenspectrums of many text similarity matrices have relatively few negative eigenvalues i.e., they are relatively close to PSD. The Twitter similarity matrix arise from the exponentiation of Word Mover s Distance (Kusner et al. 2015). STS-B and MRPC are symmetrized cross-encoder BERT sentence and document similarity matrices (Devlin et al. 2018). Eigenvalues are plotted in decreasing order of magnitude from rank 2 to 201. The magnitude of the top eigenvalue is typically very large, and so excluded for better visualization. Obs. 2: Classic Nystr om Fails on Near-PSD Matrices. Given Observation 1, it is natural to hope that perhaps the Nystr om method is directly useful in approximating many indefinite similarity matrices arising in NLP applications. Unfortunately, this is not the case the classic Nystr om method becomes very unstable and leads to large approximation errors when applied to indefinite matrices, unless they are very close to PSD. See Fig. 3. A major reason for this instability seems to be that ST KS tends to be ill-conditioned, with several very small eigenvalues that are blown up in (ST KS) 1 and lead to significant approximation error. See Fig. 2. Several error bounds for the classic Nystr om method and the related pseudo-skeleton approximation method (where the sampled sets of rows and columns may be different) applied to indefinite matrices depend on λmin(ST KS) 1, and thus grow large when ST KS has eigenvalues near zero (Cai, Nagy, and Xi 2021; Goreinov, Tyrtyshnikov, and Zamarashkin 1997; Kishore Kumar and Schneider 2017). When K is PSD, by the Cauchy interlacing theorem, ST KS is at least as well conditioned as K. However, when K is indefinite, there may exist well-conditioned principal submatrices. Indeed, a number of methods attempt to select S such that ST KS is well conditioned (Cai, Nagy, and Xi 2021). However, it is not clear how this can be done in sublinear time in general, without further assumptions. 0.0 0.1 0.2 0.3 Bin ranges Num. Eigenvalues 1.0 0.5 0.0 0.5 1.0 Bin ranges Num. Eigenvalues 1.0 0.5 0.0 0.5 1.0 Bin ranges Num. Eigenvalues Figure 2: To understand why Nystr om fails in indefinite matrices, even when they are relatively near-PSD, we independently sample ST KS with sample size of 200, 50 times. For each sample we compute all eigenvalues, combine, and plot them in a histogram. For the STS-B and MRPC matrices, ST KS often has eigenvalues very close to zero. For Twitter, which is very near-PSD, there are many fewer eigenvalues very close to zero. As we can see in Figure 3, classic Nystr om performs well on Twitter, but fails on the other two matrices. 2.3 Submatrix-Shifted Nystr om Algorithm 1: Submatrix-Shifted Nystr om (SMS-Nystr om) 1: Input: Data {xi}n i=1 X, sample sizes s1, s2, with s2 s1 scaling parameter α, similarity function : X X R. 2: Draw at set of s2 U[1, . . . , n] indices in S2. Also draw at set of s1 U[S1] indices in S1. 3: KS1 = (X, XS1), ST 1 KS1 = (XS1, XS1). 4: ST 2 KS2 = (XS2, XS2). 5: e = α λmin(ST 2 KS2). 6: KS1 = KS1 + e In,s1, where In s1 Rn s1 has Iij = 1 if i = j, Iij = 0 otherwise. 7: ST 1 KS1 = ST 1 KS1 + e Is1 s1 . 8: Return Z = KS1(ST 1 KS1) 1/2 with ZZT K. Given the above observations, our goal is to extend the Nystr om method so that it can be applied to near-PSD matrices. Our approach is based on a simple idea: if we let λmin(K) denote the minimum eigenvalue of K, then K = K λmin(K)In n is PSD. K can thus be approximated with Nystr om, and if |λmin(K)| is not too large, this should yield a good approximation to K. There are two issues with the above approach: (1) λmin(K) cannot be computed without fully forming K and (2) when |λmin(K)| is relatively large, the shift can have a significant negative impact on the approximation quality this often occurs in practice see Fig. 1. We resolve these issues by instead sampling a small principal submatrix of K, computing its minimum eigenvalue, and using this value to shift K. Specifically, consider the Nystr om approximation KS1(ST 1 KS1) 1KS1 generated by sampling a set of s1 indices S1 [n]. We let S2 be a superset of S1, with size s2. We typically simply set s2 = 2 s1. We then compute e = λmin(ST 2 KS2) and apply the Nystr om method to K = K e In n. Since ST 2 KS2 is a principal submatrix of K, e = λmin(ST 2 KS2) λmin(K) and thus K will generally not be PSD. However, we do have e λmin(ST 1 KS1), since ST 1 KS1 is a submatrix of ST 2 KS2. Thus, ST 1 KS1 e In n will always be PSD. We also do not expect this matrix to have any very small eigenvalues, since we expect a fairly large gap between λmin(ST 2 KS2) and λmin(ST 1 KS1) when s2 is significantly larger than s1 e.g. s2 = 2 s1. To further insure this, we can multiply e by a small constant factor α > 1 (we typically use α = 1.5) before applying the shift. Since (ST 1 KS1 e In n) 1 is the joining matrix in the Nystr om approximation of K, our method resolves the issue of small eigenvalues discussed in Sec. 2.2. As we observe in Sec. 3, it is enough to recover the strong performance of Nystr om on many near-PSD matrices. Since the minimum eigenvalue of ST 2 KS2 is typically much smaller in magnitude than λmin(K), we often see improved accuracy over the exact correction baseline. We call our method Submatrix-shifted Nystr om (SMSNystr om) and give full pseudocode in Algorithm 1. SMSNystr om requires roughly the same number of similarity computations and running time as classsic Nystr om. We need to perform (s2 s1)2 additional similarity computations to form ST 2 KS2 and must also compute λmin(ST 2 KS2), which takes O(s3 2) using a full eigendecomposition. However, this value can also be very efficiently approximated using iterative methods, and typically this additional computation is negligible compared to the full Nystr om running time. 3 Matrix Approximation Results We now evaluate SMS-Nystr om and several baselines in approximating a representative subset of matrices. CUR Decomposition. In addition to classic Nystr om method, we consider a closely related family of CUR decomposition methods (Mahoney and Drineas 2009; Wang, Zhang, and Zhang 2016; Pan et al. 2019). In CUR decomposition, K Rn n is approximated as the product of a small subset of columns KS1 Rn s1, a small subset of rows ST 2 K Rs2 n, and a joining matrix U Rs1 s2. KS1 and ST 2 K are generally sampled randomly the strongest theoretical bounds require sampling according to row/column norms or matrix leverage scores (Drineas, Kannan, and Mahoney 2006; Drineas, Mahoney, and Muthukrishnan 2008). However, these sampling probabilities require Ω(n2) time to compute and so we focus on the setting where columns and rows are sampled uniformly at random. There are multiple possible options for the joining matrix U. Most simply and analogously to the Nystr om method, we can set U = (ST 2 KS1)+ this is also called skeleton approximation (Goreinov, Tyrtyshnikov, and Zamarashkin 1997). In fact, if S1 = S2, and K is symmetric this method is identical to Nystr om. Alternatively, as suggested e.g., in (Drineas, Kannan, and Mahoney 2006), we can set s1 = s2 = s and U = n s (KS1ST 1 K) 1ST 1 KS2. As we will see, these different choices yield very different performance. Results. We report matrix approximation error vs. sample size for several CUR variants, along with Nystr om and SMSNystr om on the text similarity matrices from Fig. 1, along with a random PSD matrix (see Fig. 3). Nystr om. As discussed in Sec. 2, Nystr om performs well on the PSD matrix and the Twitter matrix, which is near PSD, it completely fails on the other matrices. SMS-Nystr om. Our simple Nystr om variant with s2 = 2s1 and α = 1.5 performs well on all test cases, matching the strong performance of Nystr om on the PSD and Twitter matrix, but still performing well on the less-near PSD cases of STS-B and MRPC. Skeleton Approximation. Similar results to Nystr om are observed for the closely related skeleton approximation method when U = (ST 2 KS1)+, s1 = s2, and S1, S2 are sampled independently. This is unsurprising this method is quite similar to Nystr om. Si CUR. If we modify the skeleton approximation, using s2 > s1, we also obtain strong results. Many theoretical bounds for CUR with joining matrix U = (ST 2 KS1)+ require s2 > s1 (cf. (Drineas, Mahoney, and Muthukrishnan 2008)), and this choice has a significant effect. It is similar to how SMS-Nystr om regularizes the inner matrix ST 2 KS1 is a rectangular matrix whose minimum singular value is unlikely to be too small. We find that setting s2 = 2s1 yields good performance in all cases. To minimize similarity computations, we have S1 sample a random subset of the indices in S2. There is very little performance difference if S1 and S2 are chosen independently. We call this approach Si CUR for Simple CUR . Sta CUR: Using the U = n s (KS1ST 1 K) 1ST 1 KS2 variant of CUR with s = s1 = s2 yields Stable CUR (Sta CUR). Sta CUR gives good results on all datasets, but is outperformed by Nystr om on PSD matrices and by SMS-Nystr om and Si CUR in most other cases. Unlike SMS-Nystr om and Si CUR, Sta CUR has no parameters to tune. Unlike skeleton approximation, setting s2 > s1 seems to have little effect so we keep s1 = s2. In Fig. 3 we report results for two variants Sta CUR(s) and Sta CUR(d), where S1, S2 are set equal or two independent samples respectively. Sta CUR(s) typically performs better and requires roughly half as many similarity computations, so we use this variant for the remainder of our evaluations. 4 Empirical Evaluation We evaluate SMS-Nystr om, along with Si CUR and Sta CUR on approximating similarity matrices used in document classification, sentence similarity, and cross document coreference. In each application, we show that our approximation techniques can achieve downstream task performance that matches or is competitive with exact methods, using a fraction of the computation. See the full version (Ray et al. 2021) for more results and details (e.g., hyperparameter optimization). 4.1 Document Classification with WMD Our first application is approximating Word mover s distance (WMD) (Kusner et al. 2015) in document classification. WMD, related to the Earthmover s distance, measures the alignment of words in two documents according to a word embedding space. Word Movers Embedding. Wu et al. (2018) suggests a PSD similarity function derived from WMD, for which the similarity matrix K can be approximated efficiently as K ZZT using a random features method. The resultant feature embeddings Z are called Word mover s embeddings (WME). Experiments show that WME outperforms true WMD in several tasks (Wu et al. 2018). Our Approach. Following (Wu et al. 2018), we define a similarity function between two documents x, ω by (x, ω) = exp( γWMD(x, ω)) for a scalar parameter γ. While this function does not seem to be PSD, it tends to produce near PSD matrices see. e.g. the Twitter matrix in Fig. 1. We then approximate the similarity matrix K using our Nystr om and CUR variants. For Nystr om, we write K = ZZT and use Z as document embeddings (see Alg. 1). For CUR, we factor U using its SVD U = WSVT as (WS1/2)(S1/2VT ), and use CWS1/2 as document embeddings. Evaluation. We evaluate the performance of our embeddings in multi-class classification for four different corpora drawn from (Huang et al. 2016; Kusner et al. 2015) Twitter (2176 train, 932 test), Recipe-L (27841 train, 11933 test), Ohsumed (3999 train, 5153 test), and 20News (11293 train, 7528 test). Following (Wu et al. 2018) we compare the performance of the embeddings produced by WME, SMS-Nystr om, Si CUR, and Sta CUR at several sample sizes s. Small Rank , is the dimension 550 for which the method achieves highest performance. Large Rank is the dimension 4096 (1500, and 2500 resp. for Twitter and Ohsumed) where the method achieves highest performance. For all except WME, the optimal ranks are typically around the dimension limits. This is expected since larger samples results in better approximation. As baselines, we also compare against (1) WMD-kernel, which uses the true similarity matrix with entries given by (x, ω) = exp( γWMD(x, ω)) and (2) Optimal which uses the optimal rank-k approximation to K computed with SVD. This method is inefficient, but can be thought of as giving a cap on the performance of our sublinear time methods. Results. Our results are reported in Table 1. SMS-Nystr om consistently outperforms all other methods, and even at relatively low-rank nears the optimal accuracy. In general, the similarity matrix approximation methods tend to outperform the WME baseline. Interestingly, while Sta CUR tends to have lower approximation quality on these similarity matrices (see Fig. 3), its performance in downstream classification is comparable to SMS-Nystr om and Si CUR. Observe that the approximation methods achieve much higher accuracy than 0.0 0.5 1.0 Proportion of dataset Approx. Error 0.0 0.2 0.4 0.6 Proportion of dataset Approx. Error 0.0 0.2 0.4 Proportion of dataset Approx. Error 0.0 0.5 1.0 Proportion of dataset Approx. Error Figure 3: Evaluation of sublinear time Nystr om and CUR variants on the language similarity matrices described in Fig. 1, and a test PSD matrix, ZZT with Z R1000 1000 having i.i.d. N(0, 1) entries. Error is reported as K K F / K F and averaged over 10 trials. The x-axis is s/n. For Si CUR it is s2/n. If a method does not appear, it maybe due to large error it is out of range. The error may increase with samples after a certain limit, we believe this is because the correction term overwhelms the approximation error. For better viewing please visit our full paper in (Ray et al. 2021). previous work, WME, including an 8 point improvement on 20News. Our approximation methods achieve results that are within 2-4 points of accuracy of the expensive WMD-kernel true similarity matrix, while maintaining sublinear time and massive space reduction. We also observe that SMS-Nystr om and Si CUR can achieve high accuracy for small ranks, compared to both WME and WMD-kernel. The amount of computation we save is considerable, e.g., we require just 14% of the computation for Recipe-L as compared to WMD-kernel. Method Twitter Recipe L Ohsumed 20News WME SMS-N Sta CUR Si CUR Optimal 72.5 0.5 75.3 1.3 73.8 1.5 74.9 1.5 75.8 72.5 0.4 77.7 1.3 74.9 1.0 75.9 1.5 78.8 55.8 0.3 59.4 1.5 58.7 2.6 59.3 1.9 60.3 72.9 79.3 1.3 76.8 1.6 73.0 0.6 82.2 WME SMS-N Sta CUR Si CUR Optimal 74.5 0.5 76.1 1.2 71.9 2.3 75.3 2.1 76.9 79.2 0.3 80.7 1.1 77.1 1.0 79.5 1.7 81.3 64.5 0.2 65.3 1.1 55.7 0.4 63.3 2.9 68.2 78.3 86.6 1.5 84.2 2.1 85.8 1.0 88.3 WMD-kernel 78.21 82.17 69.03 89.37 Table 1: Results on document classification task. 4.2 Approximation of Cross-Encoder BERT Similarity Matrices Our second application is to approximate similarity of a crossencoder BERT model (Devlin et al. 2018). Evaluation. We consider three GLUE benchmark datasets STS-B, MRPC, and RTE. For each task, we first train the BERT model on the test set, using code from (Wolf et al. 2019). We compute the full BERT similarity matrix for all sentences in the validation set, which consists of a set of sentence pairs, each with a true score, derived from human judgements. The similarity matrices for the datasets STS-B, MRPC and RTE are 3000 3000, 816 816, and 554 554 respectively. We compute approximations to this full similarity matrix using SMS-Nystr om, Si CUR, and Sta CUR. In general, the BERT similarity matrices are non-PSD (see Fig. 1), and in fact non-symmetric. So that SMS-Nystr om can be applied, we symmetrize them as (x, ω) = 1/2 ( (x, ω)+ (ω, x)). We use the approximate similarity matrix to make predictions on a dataset of labeled sentences for evaluation. Performance is measured via Pearson and Spearman correlation with the human scores for STS-B, F1 score of predicted labels for MRPC, and accuracy for RTE. We report the average scores obtained with different sample sizes, over 50 runs. Results. Table 2 reports results for the approximations, the exact, and the symmetrized (SYM-BERT) approaches. SMSNystr om performs particularly well on STS-B, while Si CUR performs best on MRPC. All methods are comparable on RTE. This performance is inline with the accuracy in approx- 0.2 0.4 0.6 0.8 Proportion of dataset Approx. Error 0.2 0.4 0.6 0.8 Proportion of dataset Figure 4: We report the downstream task F1 performance and approximation error on Event Coref Bank (ECB+). Method STS-B(P) STS-B(S) MRPC RTE @R1 @R2 @R3 75.6 1.3 77.3 1.8 79.4 1.5 75.3 1.5 76.9 1.8 78.6 1.3 57.4 2.2 63.9 2.7 63.0 1.1 60.0 1.1 61.8 2.1 60.2 1.1 @R1 @R2 @R3 28.2 2.3 34.2 1.6 45.9 1.1 46.8 2.1 49.9 3.2 51.7 1.4 53.8 4.2 64.4 0.5 67.0 1.1 58.2 2.2 57.3 1.2 61.4 0.1 @R1 @R2 @R3 45.6 3.1 57.7 2.6 68.8 0.2 44.9 2.8 56.5 2.4 67.0 0.4 69.4 3.7 72.4 2.1 75.5 0.9 61.1 2.2 62.7 1.5 63.3 0.3 BERT S-BERT 85.1 85.5 84.7 85.1 83.3 83.8 66.0 66.1 Table 2: Comparison of original and approximated BERT similarities on GLUE benchmarks. {R1, R2, R3} for STSB, MRPC and RTE is {250, 350, 700}, {100, 250, 500}, {100, 200, 450} respectively. Ri implies ith rank and S-BERT is SYM-BERT. 4.3 Approximate Similarity Matrices for Entity & Event Coreference Cross-document entity and event coreference is the task of clustering ambiguous mentions such that each cluster refers to the same entity or event. Cattan et al. (2020) learn a pairwise function (MLP applied to the concatenation of Ro BERTa (Liu et al. 2019), embeddings of two mentions and their elementwise product), inducing an asymmetric, not-PSD, matrix. Average-linkage agglomerative clustering with a similarity threshold is used. Evaluation. We evaluate the approximation error and the downstream task performance (Co NLL F1 (Pradhan et al. 2014)) of approximating the symmetrized similarity matrix of the model on the Event Coref Bank+ Corpus (Cybulska and Vossen 2014). Results. Fig. 4 shows the downstream task performance measured in Co NLL F1 and the approximation error as a function of the number of landmarks used. We find a similar trend as the previous two tasks. Si CUR performs very well in terms of both metrics, with performance improving as more landmarks are added, achieving nearly the same F1 (within 1 point) performance when 90% of the data is used for landmarks and very competitive performance (within 1.5 points) with just 50%, a drastic reduction in time/space compared to the exact matrix. SMS-Nystr om required additional rescaling for this task likely due to sensitivity of threshold of agglomerative clustering. The results indicate that the proposed approximation could help scale models for which the Ω(n2) similarity computations would be intractable. 5 Conclusion We have shown that indefinite similarity matrices arising in NLP applications can be effectively approximated in sublinear time. A simple variant of the Nystr om method, and several simple CUR approximation methods, all display strong performance in a variety of tasks. We hope that in future work, these methods can be used to scale text classification and clustering based on cross-encoder, word mover s distance, and other expensive similarity functions, to much larger corpora. Acknowledgements This work supported in part by the Center for Data Science, in part the Center for Intelligent Information Retrieval, in part by the National Science Foundation under Grants No. 1763618 and 2046235, in part by International Business Machines Corporation Cognitive Horizons Network agreement number W1668553, along with an Adobe Research Grant. Some of the work reported here was performed using high performance computing equipment obtained under a grant from the Collaborative R&D Fund managed by the Massachusetts Technology Collaborative. Any opinions, findings and conclusions or recommendations expressed here are those of the authors and do not necessarily reflect those of the sponsor. Bakshi, A.; and Woodruff, D. P. 2018. Sublinear Time Low-Rank Approximation of Distance Matrices. Advances in Neural Information Processing Systems 31. Belongie, S.; Fowlkes, C.; Chung, F.; and Malik, J. 2002. Spectral partitioning with indefinite kernels using the Nystr om extension. In European Conference on Computer Vision. Bentivogli, L.; Clark, P.; Dagan, I.; and Giampiccolo, D. 2009. The Fifth PASCAL Recognizing Textual Entailment Challenge. In TAC. Cai, D.; Nagy, J.; and Xi, Y. 2021. Fast and stable deterministic approximation of general symmetric kernel matrices in high dimensions. http://arxiv.org/abs/2102.05215. Cattan, A.; Eirew, A.; Stanovsky, G.; Joshi, M.; and Dagan, I. 2020. Streamlining Cross-Document Coreference Resolution: Evaluation and Modeling. http://arxiv.org/abs/2009.11032. Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; and Specia, L. 2017. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. http://arxiv.org/abs/1708.00055. Chen, Y.; Gupta, M. R.; and Recht, B. 2009. Learning kernels from indefinite similarities. In Proceedings of the 26th International Conference on Machine Learning. Cuturi, M. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 26. Cybulska, A.; and Vossen, P. 2014. Using a sledgehammer to crack a nut? Lexical diversity and event coreference resolution. In LREC. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. http://arxiv.org/abs/1810.04805. Dolan, W. B.; and Brockett, C. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing. Dong, W.; Moses, C.; and Li, K. 2011. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th International World Wide Web Conference. Drineas, P.; Kannan, R.; and Mahoney, M. W. 2006. Fast Monte Carlo algorithms for matrices III: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing. Drineas, P.; Mahoney, M. W.; and Cristianini, N. 2005. On the Nystr om Method for Approximating a Gram Matrix for Improved Kernel-Based Learning. Journal of Machine Learning Research. Drineas, P.; Mahoney, M. W.; and Muthukrishnan, S. 2008. Relativeerror CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications. Frieze, A.; Kannan, R.; and Vempala, S. 2004. Fast Monte-Carlo algorithms for finding low-rank approximations. Journal of the ACM. Gionis, A.; Indyk, P.; Motwani, R.; et al. 1999. Similarity search in high dimensions via hashing. In VLDB. Gisbrecht, A.; and Schleif, F.-M. 2015. Metric and non-metric proximity transformations at linear costs. Neurocomputing. Gittens, A.; and Mahoney, M. W. 2016. Revisiting the Nystr om method for improved large-scale machine learning. The Journal of Machine Learning Research. Goreinov, S. A.; Tyrtyshnikov, E. E.; and Zamarashkin, N. L. 1997. A theory of pseudoskeleton approximations. Linear Algebra and its Applications. Huang, G.; Guo, C.; Kusner, M. J.; Sun, Y.; Sha, F.; and Weinberger, K. Q. 2016. Supervised Word Mover s Distance. Advances in Neural Information Processing Systems 29. Huang, P.-S.; Avron, H.; Sainath, T. N.; Sindhwani, V.; and Ramabhadran, B. 2014. Kernel methods match deep neural networks on TIMIT. In Proceedings of the 2014 International Conference on Acoustics, Speech, and Signal Processing. Humeau, S.; Shuster, K.; Lachaux, M.-A.; and Weston, J. 2019. Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. http://arxiv.org/abs/ 1905.01969. Hwang, Y.; Han, B.; and Ahn, H.-K. 2012. A fast nearest neighbor search algorithm by nonlinear embedding. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE. Indyk, P.; Vakilian, A.; Wagner, T.; and Woodruff, D. P. 2019. Sample-optimal low-rank approximation of distance matrices. In Proceedings of the 32nd Annual Conference on Computational Learning Theory. Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; and Liu, Q. 2019. Tiny BERT: Distilling BERT for natural language understanding. http://arxiv.org/abs/1909.10351. Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; and Yih, W.-t. 2020. Dense Passage Retrieval for Open Domain Question Answering. In Conference on Empirical Methods in Natural Language Processing. Kishore Kumar, N.; and Schneider, J. 2017. Literature survey on low rank approximation of matrices. Linear and Multilinear Algebra. Kumar, S.; Mohri, M.; and Talwalkar, A. 2012. Sampling methods for the Nystr om method. The Journal of Machine Learning Research. Kusner, M.; Sun, Y.; Kolkin, N.; and Weinberger, K. 2015. From word embeddings to document distances. In International conference on machine learning. Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; and Soricut, R. 2019. ALBERT: A lite BERT for self-supervised learning of language representations. http://arxiv.org/abs/1909.11942. Le, Q.; Sarl os, T.; and Smola, A. 2013. Fastfood-approximating kernel expansions in loglinear time. In Proceedings of the 30th International Conference on Machine Learning. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Ro BERTa: A robustly optimized BERT pretraining approach. http://arxiv.org/abs/ 1907.11692. Lv, Q.; Josephson, W.; Wang, Z.; Charikar, M.; and Li, K. 2007. Multi-probe LSH: Efficient indexing for high-dimensional similarity search. In VLDB. Mahoney, M. W. 2011. Randomized algorithms for matrices and data. http://arxiv.org/abs/1104.5557. Mahoney, M. W.; and Drineas, P. 2009. CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences. Meanti, G.; Carratino, L.; Rosasco, L.; and Rudi, A. 2020. Kernel methods through the roof: handling billions of points efficiently. Advances in Neural Information Processing Systems 33. Michel, P.; Levy, O.; and Neubig, G. 2019. Are sixteen heads really better than one? Advances in Neural Information Processing Systems 32. Musco, C.; and Musco, C. 2017. Recursive sampling for the Nystr om method. In Advances in Neural Information Processing Systems 30. Musco, C.; and Woodruff, D. P. 2017. Sublinear time low-rank approximation of positive semidefinite matrices. In Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science. Oglic, D.; and G artner, T. 2019. Scalable learning in reproducing kernel Krein spaces. In Proceedings of the 36th International Conference on Machine Learning. Orchard, M. T. 1991. A fast nearest-neighbor search algorithm. In Proceedings of the 1991 International Conference on Acoustics, Speech, and Signal Processing. Pan, V. Y.; Luan, Q.; Svadlenka, J.; and Zhao, L. 2019. CUR Low Rank Approximation of a Matrix at Sublinear Cost. http: //arxiv.org/abs/1906.04112. Piccoli, B.; and Rossi, F. 2014. Generalized Wasserstein distance and its application to transport equations with source. Archive for Rational Mechanics and Analysis. Pradhan, S.; Luo, X.; Recasens, M.; Hovy, E.; Ng, V.; and Strube, M. 2014. Scoring Coreference Partitions of Predicted Mentions: A Reference Implementation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Rahimi, A.; and Recht, B. 2007. Random Features for Large-Scale Kernel Machines. In Advances in Neural Information Processing Systems 20. Rastogi, P.; Cotterell, R.; and Eisner, J. 2016. Weighting Finite-State Transductions With Neural Context. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics. Ray, A.; Monath, N.; Mc Callum, A.; and Musco, C. 2021. Sublinear Time Approximation of Text Similarity Matrices. http://arxiv.org/ abs/2112.09631. Sanh, V.; Debut, L.; Chaumond, J.; and Wolf, T. 2019. Distil BERT, a distilled version of BERT: smaller, faster, cheaper and lighter. http://arxiv.org/abs/1910.01108. Sarlos, T. 2006. Improved approximation algorithms for large matrices via random projections. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 06). Schleif, F.-M.; Gisbrecht, A.; and Tino, P. 2018. Supervised low rank indefinite kernel approximation using minimum enclosing balls. Neurocomputing. Talwalkar, A.; and Rostamizadeh, A. 2014. Matrix coherence and the Nystr om method. http://arxiv.org/abs/1408.2044. Tam, D.; Monath, N.; Kobren, A.; Traylor, A.; Das, R.; and Mc Callum, A. 2019. Optimal Transport-based Alignment of Learned Character Representations for String Similarity. In Association for Computational Linguistics. Wang, S.; and Zhang, Z. 2013. Improving CUR matrix decomposition and the Nystr om approximation via adaptive sampling. The Journal of Machine Learning Research. Wang, S.; Zhang, Z.; and Zhang, T. 2016. Towards more efficient SPSD matrix approximation and CUR matrix decomposition. The Journal of Machine Learning Research. Williams, C.; and Seeger, M. 2001. Using the Nystr om method to speed up kernel machines. In Advances in Neural Information Processing Systems 14. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. 2019. Hugging Face s Transformers: State-of-the-art Natural Language Processing. http://arxiv.org/abs/1910.03771. Woodruff, D. P.; et al. 2014. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science. Wu, L.; Yen, I. E.; Xu, K.; Xu, F.; Balakrishnan, A.; Chen, P.-Y.; Ravikumar, P.; and Witbrock, M. J. 2018. Word mover s embedding: From word2vec to document embedding. http://arxiv.org/abs/1811. 01713. Wu, L.; Yen, I. E.-H.; Zhang, Z.; Xu, K.; Zhao, L.; Peng, X.; Xia, Y.; and Aggarwal, C. 2019. Scalable global alignment graph kernel using random features: From node embedding to graph embedding. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . Yang, T.; Li, Y.-F.; Mahdavi, M.; Jin, R.; and Zhou, Z.-H. 2012. Nystr om method vs random fourier features: A theoretical and empirical comparison. Advances in Neural Information Processing Systems 25. Zafrir, O.; Boudoukh, G.; Izsak, P.; and Wasserblat, M. 2019. Q8BERT: Quantized 8bit BERT. http://arxiv.org/abs/1910.06188. Zhang, K.; Tsang, I. W.; and Kwok, J. T. 2008. Improved Nystr om low-rank approximation and error analysis. In Proceedings of the 25th International Conference on Machine Learning.