# cminhash_improving_minwise_hashing_with_circulant_permutation__b3a84fef.pdf

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

Xiaoyun Li, Ping Li Cognitive Computing Lab Baidu Research 10900 NE 8th St. Bellevue, WA 98004, USA {lixiaoyun996, pingli98}@gmail.com

Minwise hashing (Min Hash) is an important and practical algorithm for generating random hashes to approximate the Jaccard (resemblance) similarity in massive binary (0/1) data. The basic theory of Min Hash requires applying hundreds or even thousands of independent random permutations to each data vector in the dataset, in order to obtain reliable results for (e.g.,) building large-scale learning models or approximate near neighbor search. In this paper, we propose Circulant Min Hash (C-Min Hash) and provide the surprising theoretical results that using only two independent random permutations in a circulant manner leads to uniformly smaller Jaccard estimation variance than that of the classical Min Hash with K independent permutations. Experiments are conducted to show the effectiveness of the proposed method. We also propose a more convenient CMin Hash variant which reduces two permutations to just one, with extensive numerical results to validate that it achieves essentially the same estimation accuracy as using two permutations.

1. Introduction

Given two D-dimensional binary vectors v, w {0, 1}D, the Jaccard similarity is defined as

J(v, w) = PD i=1 1{vi = wi = 1} PD i=1 1{vi + wi 1} , (1)

which is a commonly used similarity metric in machine learning and web search applications. The vectors v and w can also be viewed as two sets of items (which represent the locations of non-zero entries), where the Jaccard

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

similarity can be equivalently viewed as the size of set intersection over the size of set union. In large-scale search and learning, directly calculating the pairwise Jaccard similarity among the sample points becomes too costly as the sample size grows. The well-known method of minwise hashing (Min Hash) (Broder, 1997; Broder et al., 1997; 1998; Li and Church, 2005; Li and König, 2011) is a standard technique for computing/estimating the Jaccard similarity in massive binary datasets, with numerous applications in near neighbor search, duplicate detection, malware detection, clustering, large-scale learning, social networks, computer vision, etc. (Charikar, 2002; Fetterly et al., 2003; Henzinger, 2006; Das et al., 2007; Buehrer and Chellapilla, 2008; Gamon et al., 2008; Bendersky and Croft, 2009; Chierichetti et al., 2009; Najork et al., 2009; Pandey et al., 2009; Lee et al., 2010; Li et al., 2011; Deng et al., 2012; Chum and Matas, 2012; Li et al., 2012; Shrivastava and Li, 2012; He et al., 2013; Tamersoy et al., 2014; Shrivastava and Li, 2014; Zamora et al., 2016; Ondov et al., 2016; Zhu et al., 2017; Nargesian et al., 2018; Wang et al., 2019; Lemiesz, 2021; Tseng et al., 2021; Feng and Deng, 2021; Jia et al., 2021).

1.1. A Review of Minwise Hashing (Min Hash)

Algorithm 1 Minwise-hashing (Min Hash) Input: Binary data vector v {0, 1}D; K independent permutations π1, ..., πK: [D] [D]

Output: K hash values h1(v), ..., h K(v)

For k = 1 to K

hk(v) mini:vi =0 πk(i)

We first recap the method of minwise hashing. For simplicity, Algorithm 1 considers just one vector v {0, 1}D. In order to generate K hash values for v, we assume K independent permutations: π1, ..., πK : [D] 7 [D]. For each permutation, the hash value is the first non-zero location in the permuted vector, i.e., hk(v) = mini:vi =0 πk(i), k = 1, ..., K. Similarly, for another binary vector w {0, 1}D,

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

using the same K permutations, we can also obtain K hash values, hk(w). The estimator of J(v, w) is simply

ˆJMH(v, w) = 1

k=1 1{hk(v) = hk(w)}, (2)

where 1{ } is the indicator function. By fundamental probability and the independence among the permutations, it is easy to show that

E[ ˆJMH] = J, V ar[ ˆJMH] = J(1 J)

How large is K? The answer depends on the application domains. For example, for training large-scale machine learning models, it appears that K = 512 or K = 1024 might be sufficient (Li et al., 2011). However, for approximate near neighbor search using many hash tables (Indyk and Motwani, 1998), it is likely that K might have to be much larger than 1024 (Shrivastava and Li, 2012; 2014).

In the early work of Min Hash (Broder, 1997; Broder et al., 1997), actually only one permutation was used by storing the first K non-zero locations after the permutation. Later, Li and Church (2005) proposed better estimators to improve the estimation accuracy. The major drawback of the original scheme was that the hashed values did not form a metric space (e.g., satisfying the triangle inequality) and hence could not be used in many algorithms/applications. We believe this was the main reason why the original authors moved to using K permutations (Broder et al., 1998).

1.2. Hashing for Non-binary Data

We believe the idea of using randomness circulantly, as studied in our paper, might be helpful in broader applications. For example, minwise hashing can also be extended to the non-binary data. For two non-negative data vectors v, w RD +, the weighted Jaccard similarity is defined as PD i=1 min(vi, wi) PD i=1 max(vi, wi) , (4)

which obviously becomes Eq. (1) in binary data. Consistent weighted sampling (CWS) (Manasse et al., 2010; Ioffe, 2010) is the standard hashing method for the weighted Jaccard in massive data. In general, CWS can be applied to the scenarios where Min Hash is found useful, and in many cases CWS might be more feasible as real-valued data typically contains more information than binary. As an algorithm, CWS is considerably much more complex than Min Hash and essentially reduces to Min Hash in binary data. Recently, Li et al. (2021) developed a family of new algorithms for hashing weighted Jaccard based on extremal processes. Li and Zhang (2017) generalized (4) to datasets with negative entries and Li and Zhao (2022) reported their efforts on using CWS and variants for training deep neural networks.

1.3. Outline of Main Results

From K Permutations to two. Using K independent permutations in Min Hash has been widely used as the standard approach in textbooks and industry for over two decades. The main idea of this work, is to replace the independent permutations in Min Hash with circulant permutations. Thus, we name the proposed framework C-Min Hash (circulant Min Hash). The circulant trick was used in the literature of random projections. For example, Yu et al. (2017) showed that using circulant projections hurts the estimation accuracy, but not by too much when the data are sparse. In Section 3, we present some (perhaps surprising) theoretical findings that we just need 2 permutations in Min Hash and the results (estimation variances) are even more accurate. Basically, with the initial permutation (denoted by σ), we randomly shuffle the data to break whatever structure which might exist in the original data, and then the second permutation (denoted by π) is applied and re-used K times to generate K hash values, via circulation. This method is called C-Min Hash-(σ, π). Before that, in Section 2, we analyze a simpler variant C-Min Hash-(0, π) without initial permutation σ. Although it is not our recommended method, our analysis for C-Min Hash-(0, π) provides the necessary preparation for later methods and the intuition to understand the need for the initial permutation.

From two permutations to one. Section 5 provides a convenient variant C-Min Hash-(π, π) that only needs one permutation π for both pre-processing and hashing. The resultant estimator is no longer unbiased but the bias is extremely small and has essentially no impact on the estimation accuracy, as verified by extensive numerical experiments.

2. C-Min Hash-(0, π) Without Initial Permutation

Algorithm 2 C-Min Hash-(0, π)

Input: Binary data vector v {0, 1}D; Permutation vector π: [D] [D]

Output: Hash values h1(v), ..., h K(v)

For k = 1 to K

Shift π circulantly rightwards by k units: πk = π k hk(v) mini:vi =0 π k(i)

As shown in Algorithm 2, the C-Min Hash-(0, π) algorithm has similar operations as Min Hash. The difference lies in the permutations used in the hashing process. To generate each hash hk(v), we permute the data vector using π k, which is the permutation shifted k units circulantly towards right based on π. For example, π = [3, 1, 2, 4], π 1 =

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

𝝅 𝒌= {𝟓, 𝟖, 𝟏, 𝟕, 𝟑, 𝟒, 𝟔, 𝟐}

𝝅 𝒌+𝟏= {𝟐, 𝟓, 𝟖, 𝟏, 𝟕, 𝟑, 𝟒, 𝟔}

Figure 1. An illustration of the idea of C-Min Hash. The data vector has three non-zeros, v2 = v4 = v5 = 1. In this example, we get hash values hk(v) = 3, hk+1(v) = 1.

[4, 3, 1, 2], π 2 = [2, 4, 3, 1], etc. Conceptually, we may think of circulation as concatenating the first and last elements of a vector to form a circle; see Figure 1 for an illustration. We set the hash value hk(v) as the position of the first non-zero after being permuted by π k. Analogously, we define the C-Min Hash-(0, π) estimator of the Jaccard similarity J(v, w) as

k=1 1{hk(v) = hk(w)}, (5)

where h is the hash value output by Algorithm 2. In this paper, for simplicity, we assume K D.

Next, we present the theoretical analysis for Algorithm 2, in terms of the expectation (mean) and the variance of the estimator ˆJ0,π. Our results reveal that the estimation accuracy depends on the initial data distribution, which may lead to undesirable performance behaviors when real-world datasets exhibit various structures. On the other hand, while it is not our recommended method, the analysis serves a preparation (and insight) for the C-Min Hash-(σ, π) which will soon be described.

First, we introduce some notations and definitions. Given v, w {0, 1}D, we define a and f as

i=1 1{vi = wi = 1}, f =

i=1 1{vi + wi 1}. (6)

We say that (v, w) is a (D, f, a)-data pair, whose Jaccard similarity can be written as J = a/f.

Definition 2.1. Consider v, w {0, 1}D. Define the location vector for v, w as x {O, , }D, with xi being O , , , when vi = wi = 1, vi + wi = 1 and vi = wi = 0, respectively.

The location vector x can fully characterize a hash collision. When a permutation π k is applied, the hashes hk(v) and hk(w) would collide if after permutation, the first O is

placed before the first (counting from small to large); the location of entries would not affect the collision. This observation will be the key in our theoretical analysis.

Definition 2.2. For A, B {O, , }, let {(A, B)| } denote the set {(i, j) : (xi, xj) = (A, B), j i = }. For each 1 K 1, define

L0( ) = {(O, O)| }, L1( ) = {(O, )}, L2( ) = {(O, )},

G0( ) = {( , O)| }, G1( ) = {( , )} , G2( ) = {( , )},

H0( ) = {( , O)| }, H1( ) = {( , )}, H2( ) = {( , )}.

Remark 2.3. For the ease of notation, by circulation we write xj = xj D when D < j < 2D.

Definition 2.2 measures the relative location of different types of points in the location vector, for a specific pair of data vectors. Moreover, one can easily verify that for 1 K 1,

|L0| + |L1| + |L2| = |L0| + |G0| + |H0| = a,

|G0| + |G1| + |G2| = |L2| + |G2| + |H2| = D f,

|H0| + |H1| + |H2| = |L1| + |G1| + |H1| = f a, (7)

which is the intrinsic constraints on the size of above sets. We are now ready to analyze the expectation and variance of ˆJ0,π. It is easy to see that ˆJ0,π is still unbiased, i.e., E[ ˆJ0,π] = J, by linearity of expectation. Lemma 2.4 provides an important quantity that leads to V ar[ ˆJ0,π] which is given in Theorem 2.5. All the missing proofs in the paper are placed in Appendix A.

Lemma 2.4. For any 1 s < t K with t s = , we have that

Eπ 1{hs(v) = hs(w)}1{ht(v) = ht(w)}

= |L0( )| + (|G0( )| + |L2( )|)J

f + |G0( )| + |G1( )| ,

where the sets are defined in Definition 2.2 and hs, ht are the hash values output by Algorithm 2.

Theorem 2.5. For C-Min Hash-(0, π), the variance of ˆJ0,π is given by

V ar[ ˆJ0,π] = J

K + 2 PK s=2(s 1)ΘK s+1

where Θ Eπ 1{hs(v) = hs(w)}1{ht(v) = ht(w)}

as in Lemma 2.4 with any t s = .

Proof. We use 1s to denote 1{hs(v) = hs(w)}, 1 s K. By the expansion of variance formula, since E[12 s] = E[1s] = J, we have

V ar[ ˆJ0,π] = J

PK s=1 PK t =s E[1s1t]

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

Note here that for t > s, the t-th hash sample uses πt as the permutation, which is shifted rightwards by = t s from πs. Thus, we have E[1s1t] = E[1s i1t i] for 0 < i < s t, which implies E[1s1t] = E[111t s+1], s < t. Since by assumption K D, we have

t =s E[1s1t]

= 2E (1112 + 1113 + ... + 111K)

+ (1213 + ... + 121K) + ... + 1K 11K

s=2 (s 1)E[111K s+2] 2

s=2 (s 1)ΘK s+1.

The result then follows.

From Theorem 2.5, we see that the variance of ˆJ0,π depends on a, f, and the sizes of sets L s and G s as in Definition 2.1, which are determined by the location vector x. Since we use the original data vectors without randomly permuting the entries beforehand, V ar[ ˆJ0,π] is called location-dependent as it is dependent on the location of non-zero entries of the original data. Consequently, as will also be shown in our numerical study, V ar[ ˆJ0,π] may be either smaller or larger than that of Min Hash estimate ˆJMH up to different structure of the data vectors.

3. C-Min Hash-(σ, π) with Independent Initial Permutation

Algorithm 3 C-Min Hash-(σ, π)

Input: Binary data vector v {0, 1}D; Permutation vectors π and σ: [D] [D]

Output: Hash values h1(v), ..., h K(v)

Initial permutation: v = σ(v)

For k = 1 to K

Shift π circulantly rightwards by k units: πk = π k hk(v) mini:v i =0 π k(i)

Next, we present an improved algorithm by eliminating the location-dependennce of C-Min Hash-(0, π) as analyzed above. The method C-Min Hash-(σ, π) is summarized in Algorithm 3, which is very similar to Algorithm 2. This time, as pre-processing, we apply an initial permutation σ

π on the data to break whatever structures which might exist. Analogously, we define the C-Min Hash-(σ, π) estimator as

k=1 1{hk(v) = hk(w)}, (8)

where hk s are the hash values output by Algorithm 3. In the remaining part of this section, we will present our detailed theoretical analysis and the main result (Theorem 3.4). First, by linearity of expectation and the fact that σ and π are independent, it is easy to verify that ˆJσ,π is still an unbiased estimator of J. Based on Theorem 2.5, in the following we provide the exact variance formula of ˆJσ,π.

Theorem 3.1. Let a, f be defined as in (6). When 0 < a < f D (J / {0, 1}), we have

V ar[ ˆJσ,π] = J

K + (K 1) E

where l = max(0, D 2f + a), and

l0 f + g0 + g1 + a(g0 + l2) (f + g0 + g1)f

D a 1 D f 1

f a 1 D f s 1 s n1 D f s n2 D f s n3 f a (D f s) n4 a 1 a l1 l2

The feasible set Ξ = {l0, l2, g0, g1} satisfies the intrinsic constraints (7), and

n1 = g0 (D f s g1), n2 = D f s g1,

n3 = l2 g0 + (D f s g1),

n4 = l1 (D f s g1).

When a = 0 or f = a (J = 0 or 1), V ar[ ˆJσ,π] = 0.

As expected, since the original locational structure of the data is broken by the initial permutation σ, V ar[ ˆJσ,π] only depends on the values of (D, f, a) but not the specific set sizes as in Theorem 2.5, i.e., it is location-independent . This would make the performance of C-Min Hash-(σ, π) consistent in different tasks. In the sequel, we investigate the statistical properties of V ar[ ˆJσ,π] in more details. Firstly, same as Min Hash, Proposition 3.2 states that given D and f, the variance of ˆJσ,π is symmetric about J = 0.5, as illustrated in Figure 2, which also shows that the variance of ˆJσ,π is smaller than the variance of the original Min Hash.

0 0.2 0.4 0.6 0.8 1 J

Min Hash 0.2

D = 1000 K = 500

0 0.2 0.4 0.6 0.8 1 J

0.2 0.4 0.6

D = 1000 K = 800

Figure 2. V ar[ ˆJσ,π] versus J, with D = 1000 and varying f. Left: K = 500. Right: K = 800.

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

Proposition 3.2 (Symmetry). V ar[ ˆJσ,π] is the same for the (D, f, a)-data pair and the (D, f, f a)-data pair, 0 a f D.

A rigorous comparison of V ar[ ˆJσ,π] and V ar[ ˆJMH] appears to be a challenging task given the complicated combinatorial form of V ar[ ˆJσ,π]. The following lemma characterizes an important property of E in (10), that it is monotone in D when a and f are fixed, as illustrated in Figure 3 (left).

Lemma 3.3 (Strict Increment). Let f > a > 0 and K be arbitrary and fixed. Denote ED as in (10) in Theorem 3.1, with D is a parameter. Then, ED+1 > ED for D f.

Equipped with Lemma 3.3, we arrive at the following main theoretical result of this work, on the uniform variance reduction of C-Min Hash-(σ, π).

Theorem 3.4 (Uniform Superiority). For any two binary vectors v, w {0, 1}D with J = 0 or 1, it holds that V ar[ ˆJσ,π(v, w)] < V ar[ ˆJMH(v, w)].

Remark 3.5. In fact, from the proof of Lemma 3.3 and Theorem 3.4, we can show that the collision indicator variables 1{hk(v) = hk(w)}, k = 1, ..., K, in (8) are pairwise negatively correlated. This provides intuition on the source of variance reduction.

Proof. By assumption we have 0 < a < f. To compare V ar[ ˆJσ,π] with V ar[ ˆJMH] = J(1 J)

K + (K 1)J2

K J2, it suffices to compare E with J2. When D = f, we know that the location vector x of (v, w) contains no elements. It is easy to verify that in this case, |G0| = |G1| = |L2| = 0, and |L0| follows hyper(f 1, a, a 1). By Theorem 3.1, it follows that when D = f,

f E[|L0|] = a(a 1)

f(f 1) = J J < J2.

Recall the definition J = a 1 f 1, which is always smaller than J. On the other hand, as D , we have |L0| 0, |L2| a, |G0| a and |G1| f a. We can show that

ED J2, as D .

By Lemma 3.3, the sequence ( Ef, Ef+1, Ef+2, ...) is strictly increasing. Since it is convergent with limit J2, by the Monotone Convergence Theorem we know that ED < J2, D f. This completes the proof.

Theorem 3.4 says that, using merely two permutations as in C-Min Hash-(σ, π) improves the Jaccard estimation variance of standard Min Hash, in all cases. That said, using two permutations could be strictly better than using K permutations in minwise hashing. How does the variance of ˆJσ,π rely on a,f and K? First, interestingly, in Proposition 3.6,

101 102 103

0 0.2 0.4 0.6 0.8 1 K/D

Variance Ratio

f/D = 1 D = 1000

Figure 3. Left: Theoretical E, f = 10 fixed. Each dash line represents the corresponding J2. Right: Variance ratio V ar[ ˆ JMH] V ar[ ˆ Jσ,π] , D = 1000. This plot holds for all a value (by Proposition 3.6).

we show that the relative variance reduction of C-Min Hash- (σ, π) over Min Hash is the same for any a value for given f and K, i.e., the relative improvement is independent of the Jaccard value J at a given sparsity level.

Proposition 3.6 (Consistent Improvement). Suppose f is fixed. In terms of a, the variance ratio V ar[ ˆ JMH(v,w)] V ar[ ˆ Jσ,π(v,w)] is a constant for any 0 < a < f.

To investigate the influence of sparsity f and number of hashes K on the variance, in Figure 3 (right), we plot the variance ratio V ar[ ˆ JMH] V ar[ ˆ Jσ,π] with different f and K. The results in Figure 3 again verify Theorem 3.4, as the variance ratio is always greater than 1. We see that the improvement in variance increases both with K (i.e., more hashes) and f (i.e., more non-zero entries). Note that, by Proposition 3.6, here we do not need to consider a since it does not affect the variance ratio.

4. Numerical Experiments

In this section, we provide numerical experiments to validate our theoretical findings and demonstrate that C-Min Hash can indeed lead to smaller Jaccard estimation errors.

4.1. Sanity Check: a Simulation Study

A simulation study is conducted on synthetic data to verify the theoretical variances given by Theorem 2.5 and Theorem 3.1. We simulate D = 128 dimensional binary vector pairs (v, w) with different f and a, which have a special locational structure that the location vector x is such that a O s are followed by (f a) s and then followed by (D f) s sequentially. We plot the empirical and theoretical mean square errors (MSE = variance + bias2) in Figure 4, and we observe:

The theoretical variance matches the empirical results, confirming Theorem 2.5 and Theorem 3.1. The variance reduction effect becomes more significant with more number of hashes K.

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

1 20 40 60 80 100 120 K

(D,f,a) = (128,8,1)

Simulation (0, ) Theory ( , ) Theory Min Hash

1 20 40 60 80 100 120 K

(D,f,a) = (128,8,2)

Simulation (0, ) Theory ( , ) Theory Min Hash

1 20 40 60 80 100 120 K

(D,f,a) = (128,8,4)

Simulation (0, ) Theory ( , ) Theory Min Hash

1 20 40 60 80 100 120 K

(D,f,a) = (128,64,2)

Simulation (0, ) Theory ( , ) Theory Min Hash

1 20 40 60 80 100 120 K

(D,f,a) = (128,64,8)

Simulation (0, ) Theory ( , ) Theory Min Hash

1 20 40 60 80 100 120 K

(D,f,a) = (128,128,16)

Simulation (0, ) Theory ( , ) Theory Min Hash

Figure 4. Empirical vs. theoretical variance of ˆJ0,π (C-Min Hash- (0, π)) and ˆJσ,π (C-Min Hash-(σ, π)), on synthetic binary data vector pairs with different data statistics.

V ar[ ˆJσ,π] is always smaller than V ar[ ˆJMH], as stated by Theorem 3.4. In contrast, V ar[ ˆJ0,π] (C-Min Hash- (0, π)) varies significantly depending on different data structures, as discussed in Section 2.

4.2. Jaccard Estimation on Text and Image Datasets

We test C-Min Hash on four public datasets, including two text datasets: the NIPS full paper dataset from UCI repository (Dua and Graff, 2017), the BBC News dataset (Greene and Cunningham, 2006), and two popular image datasets: the MNIST dataset (Le Cun et al., 1998) with hand-written digits, and the CIFAR dataset (Krizhevsky, 2009) containing natural images. All the datasets are processed to be binary. For image data, we first transform the images to gray-scale, then binarize the samples by thresholding at 0.5. For each dataset with n data vectors, there are in total n(n 1)/2 data vector pairs. We estimate the Jaccard similarities for all the pairs and report the mean absolute errors (MAE). All the results are averaged over 10 independent repetitions. We report the MAE in Figure 5, from which we see that:

The MAE of C-Min Hash-(σ, π) is consistently smaller than that of Min Hash, demonstrating the practical merit of variance reduction (Theorem 3.4) to improve the Jaccard estimation accuracy. The improvements become

28 29 210 211 212

Mean Absolute Error

Min Hash C-Min Hash-(0, ) C-Min Hash-( , )

28 29 210 211 212

Mean Absolute Error

Min Hash C-Min Hash-(0, ) C-Min Hash-( , )

25 26 27 28 29

Mean Absolute Error

Min Hash C-Min Hash-(0, ) C-Min Hash-( , )

26 27 28 29 210

Mean Absolute Error

Min Hash C-Min Hash-(0, ) C-Min Hash-( , )

Figure 5. Mean Absolute Error (MAE) of pairwise Jaccard estimation: Min Hash vs. C-Min Hash on four real-world datasets.

more substantial with larger K, which is consistent with Figure 3 and Figure 4.

Without the initial permutation σ, the accuracy of CMin Hash-(0, π) depends by the distribution/structure of the original data, and it is worse than C-Min Hash- (σ, π) on all these four datasets. In addition, the performance of C-Min Hash-(0, π) on image data seems much worse than that on text data, which we believe is because the image datasets contain more structural patterns. This again suggests that the initial permutation σ might be needed in practice.

In summary, the simulation study has verified the correctness of our theoretical findings in Theorem 2.5 and Theorem 3.1. The experiments with Jaccard estimation on four real-world datasets confirm that C-Min Hash is more accurate than the original Min Hash, and the initial permutation σ is recommended.

5. C-Min Hash-(π, π): Practically Reducing to One Permutation

In this section, we propose a more convenient variant, CMin Hash-(π, π), which only requires one permutation. That is, π is used for both pre-processing and circulant hashing. The procedure is the same as Algorithm 3, except that the initial permutation σ is replaced by π. Denote the corresponding Jaccard estimator as ˆJπ,π. The complicated dependency between π (for initial permutation) and π k (for hashing) makes the estimator no longer unbiased. Nevertheless, we found through extensive numerical experiments that, the MSE of ˆJπ,π is essentially the same as ˆJσ,π.

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

100 101 102

1 Perm 2 Perm Theo.

100 101 102

1 Perm 2 Perm Theo.

100 101 102

1 Perm 2 Perm Theo.

100 101 102

(128,128,16)

1 Perm 2 Perm Theo.

Figure 6. Estimator MSE on simulated data pairs. 1 Perm is C-Min Hash-(π, π), and 2 Perm Theo. is the theoretical variance of C-Min Hash-(σ, π) (Theorem 3.1).

Figure 6 compares the empirical MSE of C-Min Hash-(π, π) with the theoretical variances of C-Min Hash-(σ, π) on simulated data vector pairs. In Figure 7, we present the MAE comparison on real datasets, where we see that the curves for these two estimators ( ˆJσ,π and ˆJπ,π) match well.

To illustrate the bias and variance of specific data pairs in more details, we test C-Min Hash-(π, π) on the Words dataset (Li and Church, 2005). For each data point, the i-th 0/1 entry indicates whether a word appears in the ith document, for a total of D = 216 documents. See the key statistics of the 120 selected word pairs in Table 1. Those pairs of words are more or less randomly selected except that we make sure they cover a wide spectrum of data distributions. Denote d as the number of non-zero entries in the vector. Table 1 reports the density d = d/D for each word vector, ranging from 0.0006 to 0.6. The Jaccard similarity J ranges from 0.002 to 0.95.

In Figures 8 - 15 (also see Appendix B), we plot the empirical MSE along with the empirical bias2 for ˆJπ,π, as well as the empirical MSE for ˆJσ,π. From the results in the Figures, we can observe

For all the data pairs, the MSE of C-Min Hash-(π, π) estimator overlaps with the empirical MSE of CMin Hash-(σ, π) estimator for all K from 1 up to 4096.

The bias2 of C-Min Hash-(π, π) is several orders of magnitudes smaller than the MSE, in all data pairs. This demonstrates that the bias of ˆJπ,π is extremely small and can be safely neglected in practice.

In all figures, the overlapping curves validate our claim that in practice, we just need one permutation π in C-Min Hash.

26 27 28 29 210 211 212

Mean Absolute Error

2 Perm 1 Perm

28 29 210 211 212

Mean Absolute Error

2 Perm 1 Perm

25 26 27 28 29

Mean Absolute Error

1 Perm 2 Perm

26 27 28 29 210

Mean Absolute Error

2 Perm 1 Perm

Figure 7. MAE of Jaccard estimation on four datasets. 1 Perm is C-Min Hash-(π, π), and 2 Perm is C-Min Hash-(σ, π).

6. Discussion and Conclusion

The method of minwise hashing (Min Hash), from the seminal works of Broder and his colleagues, has become standard in industrial practice. One fundamental reason for its wide applicability is that the binary (0/1) high-dimensional representation is convenient and suitable for a wide range of practical scenarios. To estimate the Jaccard similarity on binary data, the standard Min Hash requires to use K independent permutations, where K, the number of hashes, can be several hundreds or even thousands in practice.

We have proposed Circulant Min Hash (C-Min Hash) and present the surprising theoretical results that, with merely 2 permutations, we still obtain an unbiased estimate of the Jaccard similarity with the variance strictly smaller than that of the original Min Hash, as confirmed by numerical experiments on simulated and real datasets. The initial permutation is applied to break whatever structure the original data may exhibit. The second permutation is re-used K times in a circulant shifting fashion. Moreover, we propose a more convenient C-Min Hash variance which uses only 1 permutation for both pre-processing and circulant hashing. We validate through extensive experiments that it does not result in loss of accuracy in practice.

Practically speaking, our theoretical results may reveal a useful direction for designing hashing methods. For example, in many applications, using permutation vectors of length (e.g.,) 230 might be sufficient. While it is perhaps unrealistic to store (e.g.,) K = 1024 such permutation vectors in the memory, one can afford to store two such permutations (even in GPU memory). Using perfectly random permutations in lieu of approximate permutations would simplify the design and analysis of randomized algorithms and ensure that the practical performance strictly matches the theory.

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

Table 1. 120 selected word pairs from the Words dataset (Li and Church, 2005). For each pair, we report the density d (number of non-zero entries divided by D = 216) for each word as well as the Jaccard similarity J. Both d and J cover a wide range of values.

d1 d2 J d1 d2 J ABOUT - INTO 0.302 0.125 0.258 NEW - WEB 0.291 0.194 0.224 ABOUT - LIKE 0.302 0.140 0.281 NEWS - LIKE 0.168 0.140 0.172 ACTUAL - DEVELOPED 0.017 0.030 0.071 NO - WELL 0.220 0.120 0.244 ACTUAL - GRABBED 0.017 0.002 0.016 NOT - IT 0.281 0.295 0.437 AFTER - OR 0.103 0.356 0.220 NOTORIOUSLY - LOCK 0.0006 0.006 0.004 AND - PROBLEM 0.554 0.044 0.070 OF - THEN 0.570 0.104 0.168 AS - NAME 0.280 0.144 0.204 OF - WE 0.570 0.226 0.361 AT - CUT 0.374 0.242 0.052 OPPORTUNITY - COUNTRIES 0.029 0.024 0.066 BE - ONE 0.323 0.221 0.403 OUR - THAN 0.244 0.125 0.245 BEST - AND 0.136 0.554 0.228 OVER - BACK 0.148 0.160 0.233 BRAZIL - OH 0.010 0.031 0.019 OVER - TWO 0.148 0.121 0.289 BUT - MANY 0.167 0.116 0.340 PEAK - SHOWS 0.006 0.033 0.026 CALLED - BUSINESSES 0.016 0.018 0.043 PEOPLE - BY 0.121 0.425 0.228 CALORIES - MICROSOFT 0.002 0.045 0.0003 PEOPLE - INFO 0.121 0.138 0.117 CAN - FROM 0.243 0.326 0.444 PICKS - BOOST 0.007 0.005 0.007 CAN - SEARCH 0.243 0.214 0.237 PLANET - REWARD 0.013 0.003 0.018 COMMITTED - PRODUCTIVE 0.013 0.004 0.029 PLEASE - MAKE 0.168 0.141 0.195 CONTEMPORARY - FLASH 0.011 0.021 0.013 PREFER - PUEDE 0.010 0.003 0.0001 CONVENIENTLY - INDUSTRIES 0.003 0.011 0.009 PRIVACY - FOUND 0.126 0.136 0.053 COPYRIGHT - AN 0.218 0.290 0.209 PROSECUTION - MAXIMIZE 0.002 0.003 0.006 CREDIT - CARD 0.046 0.041 0.285 RECENTLY - INT 0.028 0.007 0.014 DE - WEB 0.117 0.194 0.091 REPLY - ACHIEVE 0.013 0.012 0.023 DO - GOOD 0.174 0.102 0.276 RESERVED - BEEN 0.172 0.141 0.108 EARTH - GROUPS 0.021 0.035 0.056 RIGHTS - FIND 0.187 0.144 0.166 EXPRESSED - FRUSTRATED 0.010 0.002 0.024 RIGHTS - RESERVED 0.187 0.172 0.877 FIND - HAS 0.144 0.228 0.214 SCENE - ABOUT 0.012 0.301 0.029 FIND - SITE 0.144 0.275 0.212 SEE - ALSO 0.138 0.166 0.291 FIXED - SPECIFIC 0.011 0.039 0.054 SEIZE - ANYTHING 0.0007 0.037 0.012 FLIGHT - TRANSPORTATION 0.011 0.018 0.040 SHOULDERS - GORGEOUS 0.003 0.004 0.028 FOUND - DE 0.136 0.117 0.039 SICK - FELL 0.008 0.008 0.085 FRANCISCO - SAN 0.025 0.049 0.476 SITE - CELLULAR 0.275 0.006 0.010 GOOD - BACK 0.102 0.160 0.220 SOLD - LIVE 0.018 0.064 0.055 GROUPS - ORDERED 0.035 0.011 0.034 SOLO - CLAIMS 0.010 0.012 0.007 HAPPY - CONCEPT 0.029 0.013 0.054 SOON - ADVANCE 0.040 0.017 0.057 HAVE - FIRST 0.267 0.151 0.320 SPECIALIZES - ACTUAL 0.003 0.017 0.008 HAVE - US 0.267 0.284 0.349 STATE - OF 0.101 0.570 0.165 HILL - ASSURED 0.020 0.004 0.011 STATES - UNITED 0.061 0.062 0.591 HOME - SYNTHESIS 0.365 0.002 0.003 TATTOO - JEANS 0.002 0.004 0.035 HONG - KONG 0.014 0.014 0.925 THAT - ALSO 0.301 0.166 0.376 HOSTED - DRUGS 0.016 0.013 0.013 THIS - CITY 0.423 0.123 0.132 INTERVIEWS - FOURTH 0.012 0.011 0.031 THEIR - SUPPORT 0.165 0.117 0.189 KANSAS - PROPERTY 0.017 0.045 0.052 THEIR - VIEW 0.165 0.103 0.151 KIRIBATI - GAMBIA 0.003 0.003 0.712 THEM - OF 0.112 0.570 0.187 LAST - THIS 0.135 0.423 0.221 THEN - NEW 0.104 0.291 0.192 LEAST - ROMANCE 0.046 0.007 0.019 THINKS - LOT 0.007 0.040 0.079 LIME - REGISTERED 0.002 0.030 0.004 TIME - OUT 0.189 0.191 0.366 LINKS - TAKE 0.191 0.105 0.134 TIME - WELL 0.189 0.120 0.299 LINKS - THAN 0.191 0.125 0.141 TOP - AS 0.140 0.280 0.217 MAIL - AND 0.160 0.554 0.192 TOP - COPYRIGHT 0.140 0.218 0.149 MAIL - BACK 0.160 0.160 0.132 TOP - NEWS 0.140 0.168 0.192 MAKE - LIKE 0.141 0.140 0.297 UP - AND 0.200 0.554 0.334 MANAGING - LOCK 0.010 0.006 0.010 UP - HAS 0.200 0.228 0.312 MANY - US 0.116 0.284 0.210 US - BE 0.284 0.323 0.335 MASS - DREAM 0.016 0.017 0.048 VIEW - IN 0.103 0.540 0.153 MAY - HELP 0.184 0.156 0.206 VIEW - PEOPLE 0.103 0.121 0.138 MOST - HOME 0.141 0.365 0.207 WALKED - ANTIVIRUS 0.006 0.002 0.002 NAME - IN 0.144 0.540 0.207 WEB - GO 0.194 0.111 0.138 NEITHER - FIGURE 0.011 0.016 0.085 WELL - INFO 0.120 0.138 0.110 NET - SO 0.101 0.154 0.112 WELL - NEWS 0.120 0.168 0.161 NEW - PLEASE 0.291 0.168 0.205 WEEKS - LONDON 0.028 0.032 0.050

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

100 101 102 103 104

WELL - INFO

2 Perm 1 Perm

100 101 102 103 104

ABOUT - INTO

2 Perm 1 Perm

100 101 102 103 104

ABOUT - LIKE

2 Perm 1 Perm

100 101 102 103 104

WELL - NEWS

2 Perm 1 Perm

100 101 102 103 104

ACTUAL - DEVELOPED

2 Perm 1 Perm

100 101 102 103 104

ACTUAL - GRABBED

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

AND - PROBLEM

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

BRAZIL - OH

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

CALLED - BUSINESSES

2 Perm 1 Perm

Figure 8. Empirical MSEs of C-Min Hash-(π, π) ( 1 Perm , red, solid) vs. C-Min Hash-(σ, π) ( 2 Perm , blue, dashed) on various data pairs from the Words dataset. We also report the empirical bias2 for C-Min Hash-(π, π) to show that the bias is so small that it can be safely neglected. The empirical MSE curves for both estimators essentially overlap for all data pairs.

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

Michael Bendersky and W. Bruce Croft. Finding text reuse on the web. In Proceedings of the Second International Conference on Web Search and Web Data Mining (WSDM), pages 262 271, Barcelona, Spain, 2009.

Andrei Z. Broder. On the resemblance and containment of documents. In Proceedings of the Conference on Compression and Complexity of SEQUENCES, pages 21 29, Positano, Amalfitan Coast, Salerno, Italy, 1997.

Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. Comput. Networks, 29(8-13):1157 1166, 1997.

Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing (STOC), pages 327 336, Dallas, TX, 1998.

Gregory Buehrer and Kumar Chellapilla. A scalable pattern mining approach to web graph compression with communities. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM), pages 95 106, Stanford, CA, 2008.

Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings on 34th Annual ACM Symposium on Theory of Computing (STOC), pages 380 388, Montreal, Canada, 2002.

Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, Michael Mitzenmacher, Alessandro Panconesi, and Prabhakar Raghavan. On compressing social networks. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 219 228, Paris, France, 2009.

Ondrej Chum and Jiri Matas. Fast computation of min-hash signatures for image collections. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3077 3084, 2012.

Abhinandan Das, Mayur Datar, Ashutosh Garg, and Shyamsundar Rajaram. Google news personalization: scalable online collaborative filtering. In Proceedings of the 16th International Conference on World Wide Web (WWW), pages 271 280, Banff, Alberta, Canada, 2007.

Fan Deng, Stefan Siersdorfer, and Sergej Zerr. Efficient jaccard-based diversity analysis of large document collections. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM), pages 1402 1411, Maui, HI, 2012.

Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci. edu/ml.

Weiqi Feng and Dong Deng. Allign: Aligning all-pair near-duplicate passages in long texts. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 541 553, Virtual Event, China, 2021.

Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scale study of the evolution of web pages. In Proceedings of the Twelfth International World Wide Web Conference (WWW), pages 669 678, Budapest, Hungary, 2003.

Michael Gamon, Sumit Basu, Dmitriy Belenko, Danyel Fisher, Matthew Hurst, and Christian König. BLEWS: using blogs to provide context for news articles. In Proceedings of the Second International Conference on Weblogs and Social Media (ICWSM), Seattle, WA, 2008.

Derek Greene and Padraig Cunningham. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the Twenty-Third International Conference on Machine Learning (ICML), pages 377 384, Pittsburgh, PA, 2006.

Kaiming He, Fang Wen, and Jian Sun. K-means hashing: An affinity-preserving quantization method for learning binary compact codes. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2938 2945, Portland, OR, 2013.

Monika Rauch Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 284 291, Seattle, WA, 2006.

Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing (STOC), pages 604 613, Dallas, TX, 1998.

Sergey Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM), pages 246 255, Sydney, Australia, 2010.

Peng Jia, Pinghui Wang, Junzhou Zhao, Shuo Zhang, Yiyan Qi, Min Hu, Chao Deng, and Xiaohong Guan. Bidirectionally densifying LSH sketches with empty bins. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 830 842, Virtual, 2021.

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

David C. Lee, Qifa Ke, and Michael Isard. Partition minhash for partial duplicate image discovery. In Proceedings of the 11th European Conference on Computer Vision (ECCV), Part I, pages 648 662, Heraklion, Greece, 2010.

Jakub Lemiesz. On the algebra of data sketches. Proc. VLDB Endow., 14(9):1655 1667, 2021.

Ping Li and Kenneth Ward Church. Using sketches to estimate associations. In Proceedings of the Conference on Human Language Technology and the Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 708 715, Vancouver, Canada, 2005.

Ping Li and Arnd Christian König. Theory and applications of b-bit minwise hashing. Commun. ACM, 54(8):101 109, 2011.

Ping Li and Cun-Hui Zhang. Theory of the GMM kernel. In Proceedings of the 26th International Conference on World Wide Web (WWW), pages 1053 1062, Perth, Australia, 2017.

Ping Li and Weijie Zhao. GCWSNet: Generalized consistent weighted sampling for scalable and accurate training of neural networks. ar Xiv preprint ar Xiv:2201.02283, 2022.

Ping Li, Anshumali Shrivastava, Joshua Moore, and Arnd Christian König. Hashing algorithms for large-scale learning. In Advances in Neural Information Processing Systems (NIPS), pages 2672 2680, Granada, Spain, 2011.

Ping Li, Anshumali Shrivastava, and Arnd Christian König. GPU-based minwise hashing. In Proceedings of the 21st World Wide Web Conference (WWW), pages 565 566, Lyon, France, 2012.

Ping Li, Xiaoyun Li, Gennady Samorodnitsky, and Weijie Zhao. Consistent sampling through extremal process. In Proceeding of the Web Conference (WWW), pages 1317 1327, Virtual Event / Ljubljana, Slovenia, 2021.

Mark Manasse, Frank Mc Sherry, and Kunal Talwar. Consistent weighted sampling. Technical Report MSR-TR2010-73, Microsoft Research, 2010.

Marc Najork, Sreenivas Gollapudi, and Rina Panigrahy. Less is more: sampling the neighborhood graph makes salsa better and faster. In Proceedings of the Second International Conference on Web Search and Web Data Mining (WSDM), pages 242 251, Barcelona, Spain, 2009.

Fatemeh Nargesian, Erkang Zhu, Ken Q. Pu, and Renée J. Miller. Table union search on open data. Proc. VLDB Endow., 11(7):813 825, 2018.

Brian D Ondov, Todd J Treangen, Páll Melsted, Adam B Mallonee, Nicholas H Bergman, Sergey Koren, and Adam M Phillippy. Mash: fast genome and metagenome distance estimation using minhash. Genome biology, 17 (1):1 14, 2016.

Sandeep Pandey, Andrei Broder, Flavio Chierichetti, Vanja Josifovski, Ravi Kumar, and Sergei Vassilvitskii. Nearestneighbor caching for content-match applications. In Proceedings of the 18th International Conference on World Wide Web (WWW), pages 441 450, Madrid, Spain, 2009.

Anshumali Shrivastava and Ping Li. Fast near neighbor search in high-dimensional binary data. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), pages 474 489, Bristol, UK, 2012.

Anshumali Shrivastava and Ping Li. In defense of minhash over simhash. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 886 894, Reykjavik, Iceland, 2014.

Acar Tamersoy, Kevin A. Roundy, and Duen Horng Chau. Guilt by association: large scale malware detection by mining file-relation graphs. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1524 1533, New York, NY, 2014.

Tom Tseng, Laxman Dhulipala, and Julian Shun. Parallel index-based structural graph clustering and its approximation. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 1851 1864, Virtual Event, China, 2021.

Pinghui Wang, Yiyan Qi, Yuanming Zhang, Qiaozhu Zhai, Chenxu Wang, John C. S. Lui, and Xiaohong Guan. A memory-efficient sketch method for estimating high similarities in streaming sets. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 25 33, 2019.

Felix X. Yu, Aditya Bhaskara, Sanjiv Kumar, Yunchao Gong, and Shih-Fu Chang. On binary embedding using circulant matrices. J. Mach. Learn. Res., 18:150:1 150:30, 2017.

Juan Zamora, Marcelo Mendoza, and Héctor Allende. Hashing-based clustering in high dimensional data. Expert Syst. Appl., 62:202 211, 2016.

Erkang Zhu, Ken Q. Pu, Fatemeh Nargesian, and Renée J. Miller. Interactive navigation of open data linkages. Proc. VLDB Endow., 10(12):1837 1840, 2017.

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

A. Proofs of Technical Results

We first recall the notations and definitions that will be used in our proof.

Notations. In our analysis, for simplicity we will use 1s to denote 1{hs(v) = hs(w)} for 1 s K, where h is the hash value. Given two data vectors v, w {0, 1}D. Recall in (6) that a = PD i=1 1{vi = 1 and wi = 1}, f = PD i=1 1{vi = 1 or wi = 1}. Thus, the Jaccard similarity is J = a/f. We also define J = (a 1)/(f 1).

Definition A.1. Let v, w {0, 1}D. Define the location vector as x {O, , }D, with xi being O , , when vi = wi = 1, vi + wi = 1 and vi = wi = 0, respectively.

Definition A.2. For A, B {O, , }, let {(i, j) : (A, B)| } denote a pair of indices {(i, j) : (xi, xj) = (A, B), j i = }. Define

L0( ) = {(i, j) : (O, O)| }, L1( ) = {(i, j) : (O, )| }, L2( ) = {(i, j) : (O, )| },

G0( ) = {(i, j) : ( , O)| }, G1( ) = {(i, j) : ( , )| } , G2( ) = {(i, j) : ( , )| },

H0( ) = {(i, j) : ( , O)| }, H1( ) = {(i, j) : ( , )| }, H2( ) = {(i, j) : ( , )| }.

Remark A.3. For the ease of notation, by circulation we write xj = xj D when D < j < 2D.

One can easily verify that given fixed a, f, D, it holds that for 1 K 1,

|L0( )| + |L1( )| + |L2( )| = |L0( )| + |G0( )| + |H0( )| = a,

|G0( )| + |G1( )| + |G2( )| = |L2( )| + |G2( )| + |H2( )| = D f,

|H0( )| + |H1( )| + |H2( )| = |L1( )| + |G1( )| + |H1( )| = f a. (11)

We will refer this as the intrinsic constraints on the sizes of above sets.

A.1. Proof of Lemma 2.4

Lemma 2.4. For any 1 s < t K with t s = , we have

Eπ 1{hs(v) = hs(w)}1{ht(v) = ht(w)} = |L0( )| + (|G0( )| + |L2( )|)J

f + |G0( )| + |G1( )| ,

where the sets are defined in Definition 2.2 and hs, ht are the hash values output by Algorithm 2.

Proof. To check whether a hash sample generated by C-Min Hash collides (under some permutation π), it suffices to look at the permuted location vector x. If a collision happens if only if the first type O point appears before the first point after being permuted by π. That said, the minimal permutation index of O elements must be smaller than that of elements. If the hash sample does not collide, then the first must appear before the first O . Note that points does not affect the collision. This observation will be the key to our analysis.

To compute the variance of the C-Min Hash-(0, π) estimator, it suffices to compute E[1s1t]. Let L, G and H denote the union of L s, G s and H s, respectively. In the following, we say that an index i belongs to a set if i is the first term of an element in that set, e.g., {1} belongs to L if x1 = O. We have

|L| = a, |H| = f a, |G| = D f.

One key observation is that, for a pair (i, j) with |j i| = = t s in above sets, the hash index πs(i) will be the hash index of πt(j). We begin by decomposing the expectation of interest into

E[1s1t] = P[collision s, collision t]

i s L P[collision s at i s, collision t]

i s Lp P[collision s at i s, collision t]. (12)

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

where i s is the location of the original O in vector x that collides for the s-th hash sample. Note that it is different from the exact location of collision in x(πs). Also, the permutation is totally random, so the location of collision is independent of 1s, and uniformly distributed among all type O pairs. We consider the following cases.

1) When i s L0. In this case, the minimum index of the type O pair in x(πs), πs(i s), is shifted to another type O pair in x(πt). Therefore, the indices of pairs with the first element being O or originally in x(πs) will still be greater than πt(i s). If sample s collides at i s, hash sample t will collide when

1. All the points in G1, after permutation πs, is greater than πs(i s). In this case, regardless of the permuted G0, hash t will always collide.

2. There exist points in G1 after permutation πs that are smaller than πs(i s), and also points in G0 that is smaller than the minimum of permuted G1.

Consequently, we have for i s L0,

P[collision s at i s, collision t]

= P[πs(i s) < πs(i), i H L/i s, and min j G1 πs(j) > πs(i s)]

+ P[πs(i s) < πs(i), i H L/i s, and min j G0 πs(j) < min j G1 πs(j) < πs(i s)]

a a f + |G1| + |G0| f + |G0| + |G1| |G1| f + |G1| a

= 1 f + |G1| + |G0| |G1| (f + |G0| + |G1|)(f + |G1|)f . (13)

This probability holds for i s L0.

2) When i s L1. Similarly, we consider the condition where i s L1, and both hash samples collide. In this case, πs(i s) would be shifted to a pair in x(πt). That is, the indices of pairs with the first element being O or originally in x(πs) will all be greater than πs(i s), which now is the location of a pair in x(πt). Thus, to make hash t collide, we need:

At least one point from G0 is smaller than any other points in H L G1 after permutation πs.

Therefore, for any i s L1,

P[collision s at i s, collision t]

= P[πs(i s) < πs(i), i H L/i s, and min j G0 πs(j) < min{πs(i s), min j G1 πs(j)}]

= |G0| f + |G0| + |G1| a

= |G0| (f + |G0| + |G1|)f , (14)

which is true for i s L1.

3) When i s L2.

In this scenario, πs(i s) would be shifted to a pair in x(πt). Therefore, if hash s collides, hash t will also collide when:

After applying πs, the minimum of L0 H0 G0 is smaller than the minimum of L1 H1 G1.

Thus, we obtain that for any i s L2,

P[collision s at i s, collision t]

= P[πs(i s) < πs(i), i H L/i s, and min j L0 G0 H0 πs(j) < min j L1 G1 H1 πs(j)]

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

Let 1s,i s denote the event {πs(i s) < πs(i), i H L/i s}. Then Ωcan be separated into the following cases:

1. Ω1: 1s,i s, minj L0 H0 πs(j) < minj L1 H1 πs(j), and minj L0 H0 πs(j) < minj G1 πs(j).

2. Ω2: 1s,i s, minj L0 H0 πs(j) < minj L1 H1 πs(j), and minj L0 H0 πs(j) > minj G1 πs(j) > minj G0 πs(j) > πs(i s).

3. Ω3: 1s,i s, minj L0 H0 πs(j) < minj L1 H1 πs(j), and minj L0 H0 πs(j) > minj G1 πs(j) > πs(i s) > minj G0 πs(j).

4. Ω4: 1s,i s, minj L0 H0 πs(j) < minj L1 H1 πs(j), and minj L0 H0 πs(j) > πs(i s) > minj G1 πs(j) > minj G0 πs(j).

5. Ω5: 1s,i s, minj L0 H0 πs(j) > minj L1 H1 πs(j), and πs(i s) < minj G0 πs(j) < minj L1 H1 G1 πs(j).

6. Ω6: 1s,i s, minj L0 H0 πs(j) > minj L1 H1 πs(j), and minj G0 πs(j) < πs(i s) < minj L1 H1 G1 πs(j).

We can compute the probability of each event as

a a f + |G1| |L0| + |H0| |L0| + |H0| + |L1| + |H1| + |G1|,

= a |G0| (f |G0|)(f + |G1|),

a a f + |G0| + |G1| |G0| |G0| + |G1| + |L0| + |H0| + |L1| + |H1|

|G1| |L0| + |H0| + |L1| + |H1| + |G1| |L0| + |H0| |L0| + |H0| + |L1| + |H1|

= 1 f + |G0| + |G1| |G0|

f |G1| f |G0| a |G0| f |G0| |G1|

= |G0| |G1| (a |G0|) (f + |G0| + |G1|)(f |G0|)(f |G0| |G1|)f ,

P[Ω3] = |G0| f + |G0| + |G1| 1 f + |G1| |G1| |L0| + |H0| + |L1| + |H1| + |G1| |L0| + |H0| |L0| + |H0| + |L1| + |H1|

= |G0| |G1| (a |G0|) (f + |G0| + |G1|)(f + |G1|)(f |G0|)(f |G0| |G1|),

P[Ω4] = |G0| f + |G0| + |G1| |G1| f + |G1| 1

f |L0| + |H0| |L0| + |H0| + |L1| + |H1|

= |G0| |G1| (a |G0|) (f + |G0| + |G1|)(f + |G1|)(f |G0| |G1|)f ,

P[Ω5] = 1 f + |G0| + |G1| |G0| |G0| + |G1| + |L0| + |H0| + |L1| + |H1| |L1| + |H1| |L0| + |H0| + |L1| + |H1|

= |G0| (f a |G1|) (f + |G0| + |G1|)(f |G0| |G1|)f ,

P[Ω6] = |G0| f + |G0| + |G1| 1

f |L1| + |H1| |L0| + |H0| + |L1| + |H1|

= |G0| (f a |G1|) (f + |G0| + |G1|)(f |G0| |G1|)f .

P[Ω2] + P[Ω3] + P[Ω4]

= |G0| |G1| (a |G0|) (f + |G0| + |G1|)(f |G0| |G1|)

1 (f |G0|)f + 1 (f |G0|)(f + |G1|) + 1 f(f + |G1|)

= |G0| |G1| (a |G0|)(3f |G0| + |G1|) (f + |G0| + |G1|)(f |G0| |G1|)f(f |G0|)(f + |G1|).

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

Summing up all the terms together, we obtain P[Ω] as

n=1 P[Ωn] = f(f + |G0| + |G1|)(f |G0| |G1|)(a |G0|) + |G0||G1|(a |G0|)(3f |G0| + |G1|)

(f + |G0| + |G1|)(f |G0| |G1|)(f |G0|)(f + |G1|)f

+ 2|G0|(f a |G1|)(f |G0|)(f + |G1|) (f + |G0| + |G1|)(f |G0| |G1|)(f |G0|)(f + |G1|)f

= (a |G0|)(f + |G0| |G1|)(f |G0|)(f + |G1|) + 2|G0|(f a |G1|)(f |G0|)(f + |G1|)

(f + |G0| + |G1|)(f |G0| |G1|)(f |G0|)(f + |G1|)f

= (a + |G0|)(f |G0| |G1|)(f |G0|)(f + |G1|) (f + |G0| + |G1|)(f |G0| |G1|)(f |G0|)(f + |G1|)f

= a + |G0| (f + |G0| + |G1|)f , (15)

which holds for i s L2. Now combining (13), (14), (15) with (12), we obtain

E[1s1t] = |L0| f + |G1| + |G0||G1||L0| (f + |G0| + |G1|)(f + |G1|)f + |G0||L1| (f + |G0| + |G1|)f + (a + |G0|)|L2| (f + |G0| + |G1|)f . (16)

Here, recall that the sets are associated with all 1 s < t K such that = t s. Using the intrinsic constraints (11), after some calculation we can simplify (16) as

Eπ[1s1t] = |L0| f + |G0| + |G1| + a(|G0| + |L2|) (f + |G0| + |G1|)f = |L0( )| + (|G0( )| + |L2( )|)J

f + |G0( )| + |G1( )| ,

which completes the proof.

A.2. Proof of Theorem 2.5

Theorem 2.5. Under the same setting as in Lemma 2.4, the variance of ˆJ0,π is

V ar[ ˆJ0,π] = J

K + 2 PK s=2(s 1)ΘK s+1

where Θ Eπ 1{hs(v) = hs(w)}1{ht(v) = ht(w)} as in Lemma 2.4 with any t s = .

Proof. By the expansion of variance formula, since E[12 s] = E[1s] = J, we have

V ar[ ˆJ0,π] = J

PK s=1 PK t =s E[1s1t]

K2 J2. (17)

Note here that for t > s, the t-th hash sample uses πt as the permutation, which is shifted rightwards by = t s from πs. Thus, we have E[1s1t] = E[1s i1t i] for 0 < i < s t, which implies E[1s1t] = E[111t s+1], s < t. Since by assumption K D, we have

t =s E[1s1t] = 2E (1112 + 1113 + ... + 111K) + (1213 + ... + 121K) + ... + 1K 11K

= 2E (1112 + 1113 + ... + 111K) + (1112 + ... + 111K 1) + ... + 1112

s=2 (s 1)E[111K s+2]

s=2 (s 1)ΘK s+1. (18)

Finally, integrating (17), (18) and Lemma 2.4 completes the proof.

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

A.3. Proof of Theorem 3.1

Theorem 3.1. Let a, f be defined as in (6). When 0 < a < f D (J / {0, 1}), we have

V ar[ ˆJσ,π] = J

K + (K 1) E

where l = max(0, D 2f + a), and

{l0,l2,g0,g1}

( l0 f + g0 + g1 + a(g0 + l2) (f + g0 + g1)f

D a 1 D f 1

f a 1 D f s 1 s n1 D f s n2 D f s n3 f a (D f s) n4 a 1 a l1 l2

The feasible set {l0, l2, g0, g1} satisfies the intrinsic constraints (7), and

n1 = g0 (D f s g1), n2 = D f s g1,

n3 = l2 g0 + (D f s g1), n4 = l1 (D f s g1).

Also, when a = 0 or f = a (J = 0 or J = 1), we have V ar[ ˆJσ,π] = 0.

Proof. Similar to the proof of Theorem 2.5, we denote Θ = Eσ,π[1s1t] with |t s| = . Note that now the expectation is taken w.r.t. both two independent permutations σ and π. Since σ is random, we know that Θ1 = Θ2 = = ΘK 1. Then by the variance formula, we have

V ar[ ˆJσ,π] = J2

Hence, it suffices to consider Θ1. In this proof, we will set = 1 and drop the notation for conciseness, and denote E = Θ1 from now on. First, we note that Lemma 2.4 gives the desired quantity conditional on σ. By the law of total probability, we have

|L0| f + |G0| + |G1| + a(|G0| + |L2|) (f + |G0| + |G1|)f

where the sizes of sets are random depending on the initial permutation σ (i.e. counted after permuting by σ). As a result, the problem turns into deriving the distribution of |L0|, |L1|, |L2|, |G0| and |G1| under random permutation σ, and then taking expectation of (22) with respect to this additional randomness.

When a = 0, we know that |L0| = |L2| = |G0| = 0, hence the expectation E is trivially 0. Thus, the V ar[ ˆJσ,π] = 0. When f = a, |G1| = 0, and the constraint on the sets becomes

|L0| + |G0| = |L0| + |L2| = f,

|L2| + |G2| = |G0| + |G2| = D f.

Then (22) becomes

|L0| f + |G0| + |G0| + |L2|

|L0| + |G0| + |L2|

Therefore, when f = a, we also have V ar[ ˆJσ,π] = 0.

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

Next, we will consider the general case where 0 < a < f D. This can be considered as a combinatorial problem where we randomly arrange a type O , (f a) type and (D f) type points in a circle. We are interested in the distribution of the number of {O, O}, {O, }, {O, }, { , O} and { , } pairs of consecutive points in clockwise direction. We consider this procedure in two steps, where we first place and points, and then place O points.

Step 1. Randomly place and points on the circle.

In this step, four types of pairs may appear: { , }, { , }, { , } and { , }. Denote C1, C2, C3 and C4 as the collections of above pairs. Since

|C1| + |C4| = |C1| + |C2| = D f,

|C2| + |C3| = |C2| + |C4| = f a,

knowing the size of one set gives information on the size of all the sets. Thus, we can characterize the joint distribution by analyzing the distribution of |C1|. First, placing (D f) points on a circle leads to (D f) number of { , } pairs. This (D f) elements can be regarded as the borders that split the circle into (D f) bins. Now, we randomly throw (f a) number of points into these bins. If at least one falls into one bin, then the number of { , } pairs (|C1|) would reduce by 1, while |C2| and |C4| would increase by 1. If z points fall into one bin, then the number of { , } (|C3|) would increase by (z 1). Notice that since s D f and D f s f a, we have max(0, D 2f +a) s D f. Consequently, for s in this range, we have

P n |C1| = s o = P n |C1| = s, |C3| = f a (D f s) o

D f D f s f a 1 D f s 1

D a 1 D f 1

D f s f a 1 D f s 1

D a 1 D f 1 . (23)

The second line is due to the stars and bars problem that the number of ways to place n unlabeled balls in m distinct bins such that each bin has at least one ball is n 1 m 1 . For |C1| = s, we need n = f a (number of ) and m = |C2| = D f s. Moreover, the number of ways to place n balls in m distinct bins is n+m 1 m 1 . When counting the total number of possibilities, we have n = f a, m = D f. This gives the denominator. Note that (23) is a hyper-geometric distribution.

Step 2. Randomly place O points on the circle.

We have the probability mass function

P[Ψ] P n |L1| = l1, |L2| = l2, |G0| = g0, |G1| = g1 o

s=D 2f+a P n |L1| = l1, |L2| = l2, |G0| = g0, |G1| = g1 |C1| = s o P n |C1| = s o . (24)

It remains to compute the distribution conditional on |C1|. Here we drop |L0| since it is intrinsically determined by |L1|, |L2|. Again, given a placement of all and points, each consecutive pair can be regarded as a distinct bin. The problem is hence to randomly throw a type O points into that (D a) bins, given that we have placed type and points on the circle with |C1| = s (thus |C2| = |C3| = D f s and |C4| = f a (D f s) are also determined correspondingly). In the following, we count the number of O points that fall in Ci, i = 1 to 4, to make the event Ψ happen. Note that

When at least one O point falls into C1 (between { , }), |L2| and |G0| increase by 1.

When at least one O point falls into C2 (between { , }), |L1| and |G0| increase by 1, while |G1| decreases by 1.

When at least one O point falls into C3 (between { , }), |L2| increases by 1.

When at least one O point falls into C4 (between { , }), |L1| increases by 1.

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

We denote the number of bins in Ci, i = 1, 2, 3, 4 that contain at least one O point as n1, n2, n3, n4, respectively. As a result of above reasoning, in the event Ψ, we have

n1 + n3 = l2, n2 + n4 = l1, n1 + n2 = g0, D f s n2 = g1.

Solving the equations gives

n1 = g0 (D f s g1), n2 = D f s g1, n3 = l2 g0 + (D f s g1), n4 = l1 (D f s g1).

Note that P4 i=1 ni = l1 + l2. Therefore, event Ψ is equivalent to randomly picking n1, n2, n3 and n4 bins in C1,...,C4, and then allocate a type O points in these (l1 + l2) bins such that each bin contains at least one O . Hence, we obtain

P n |L1| = l1, |L2| = l2, |G0| = g0, |G1| = g1 |C1| = s o =

s n1 D f s n2 D f s n3 f a (D f s) n4 a 1 l1+l2 1

s n1 D f s n2 D f s n3 f a (D f s) n4 a 1 a l1 l2

D 1 a , (25)

which is also a multi-variate hyper-geometric distribution. Now combining (23), (24) and (25), we obtain the joint distribution of |L0|, |L1|, |L2|, |G0| and |G1| as

P n |L1| = l1, |L2| = l2, |G0| = g0, |G1| = g1 o

s=max(0,D 2f+a)

s n1 D f s n2 D f s n3 f a (D f s) n4 a 1 a l1 l2

D f s f a 1 D f s 1

D a 1 D f 1 . (26)

Now let Ξ be the feasible set of (l0, l1, g0, g1, g2) that satisfies the intrinsic constraints (11). The desired expectation w.r.t. both π and σ can thus be written as

l0 f + g0 + g1 + a(g0 + l2) (f + g0 + g1)f

s=max(0,D 2f+a)

s n1 D f s n2 D f s n3 f a (D f s) n4 a 1 a l1 l2

D f s f a 1 D f s 1

D a 1 D f 1

The desired result then follows from (21).

A.4. Proof of Proposition 3.2

Proposition 3.2 (Symmetry). V ar[ ˆJσ,π] is the same for the (D, f, a)-data pair and the (D, f, f a)-data pair, for 0 a f D.

Proof. For fixed a, f, D, let E1 be the expectation defined in Theorem 3.1 for (v1, w1), and E2 be that for (v2, w2). From Theorem 3.1 we know that

E1 = E(l0,l2,g0,g1) h l0 f + g0 + g1 + a(g0 + l2) (f + g0 + g1)f

where (l0, l2, g0, g1) follows the distribution of (|L0|, |L2|, |G0|, |G1|) associated with the location vector x1 of (v1, v2). For data pair (v2, w2), we can consider its location vector x2 as swapping the O and entries of x1. Now we denote

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

the size of the corresponding sets (Definition 2.2) of x2 as l is, g is, h is, for i = 0, 1, 2. Since σ is applied before hashing, by symmetry there is a one-to-one correspondence between the two location vectors. More specifically, l 0 corresponds to h1, g 0 corresponds to g1, g 1 corresponds to g0, and l 2 corresponds to h2. Therefore, in probability we can write

E2 = E(l 0,l 2,g 0,g 1) h l 0 f + g 0 + g 1 + a(g 0 + l 2) (f + g 0 + g 1)f

= E(h1,h2,g0,g1) h h1 f + g0 + g1 + (f a)(g1 + h2)

(f + g0 + g1)f

Consequently, we have

E1 E2 = E(l0,l2,h1,h2,g0,g1) h l0 h1 f + g0 + g1 + a(g0 + l2) (f a)(g1 + h2)

(f + g0 + g1)f

In the sequel, the subscript of expectation is suppressed for conciseness. Exploiting the constraints (11), we deduce that h1 = (f a) l1 g1, h2 = l0 + g0 + l1 + g1 a and l0 + l1 = a l2. Using these facts we obtain

E1 E2 = E h(l0 (f a) + l1 + g1)f + a(g0 + l2) (f a)(l0 + g0 + l1 + 2g1 a)

(f + g0 + g1)f

= E h(2a f + g1 l2)f + a(g0 + l2) (f a)(2g1 + g0 l2)

(f + g0 + g1)f

= E h2(f + g0 + g1)a (f + g0 + g1)f

(f + g0 + g1)f

Comparing the variances of ˆJσ,π(v1, w1) and ˆJσ,π(v2, w2), we derive

V ar[ ˆJσ,π(v1, w1)] V ar[ ˆJσ,π(v2, w2)]

K + (K 1) E1

K + (K 1) E2

K (2J 1) + K 1

K ( E1 E2) = 0.

This completes the proof.

A.5. Proof of Lemma 3.3

Lemma 3.3 (Strict Increment). Assume a > 0 and f > a are arbitrary and fixed. Denote ED as in (20) in Theorem 3.1, with D treated as a parameter. Then we have ED+1 > ED for D f.

Proof. The lemma basically says that E is monotonically increasing when we append more entries to the data vector. Let the probability mass function (26) parameterized by a, f and dimensionality D be Pa,f,D(l0, l2, g0, g1). Conditional on l0, l2, g0, g1 with D elements, there are several cases for the possible values l 0, l 2, g 0, g 1 when adding a :

g 0 = g0 + 1, l 0 = l0, l 2 = l2, g 1 = g1. This is true when the new elements falls between a pair of ( , O), with probability l1+l2 g0

g 1 = g1 + 1, l 0 = l0, l 2 = l2, g 0 = g0, when the new elements falls between a pair of ( , ), with probability f a l1 g1

g 1 = g1 + 1, l 2 = l2 + 1, l 0 = l0, g 0 = g0, when the new elements falls between a pair of (O, ), with probability l1

l 0 = l0 1, l 2 = l2 + 1, g 0 = g0 + 1, g 1 = g1, when the new elements falls between a pair of (O, O). The probability of this event is l0

All values unchanged, when the falls between other types of pairs, with probability D f+g0+g1

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

Denote ΞD as the feasible set satisfying (11) with dimension D f. Above reasoning builds a correspondence between ΞD and ΞD+1. More precisely, we have

l 0 f + g 0 + g 1 + a(g 0 + l 2) (f + g 0 + g 1)f

Pa,f,D+1(l 0, l 2, g 0, g 1)

l0 f + g0 + g1 + 1 + a(g0 + l2 + 1) (f + g0 + g1 + 1)f

D Pa,f,D(l0, l2, g0, g1)

+ l0 f + g0 + g1 + 1 + a(g0 + l2) (f + g0 + g1 + 1)f

D Pa,f,D(l0, l2, g0, g1)

+ l0 f + g0 + g1 + 1 + a(g0 + l2 + 1) (f + g0 + g1 + 1)f

DPa,f,D(l0, l2, g0, g1)

+ l0 1 f + g0 + g1 + 1 + a(g0 + l2 + 2) (f + g0 + g1 + 1)f

DPa,f,D(l0, l2, g0, g1)

+ l0 f + g0 + g1 + a(g0 + l2) (f + g0 + g1)f

D f + g0 + g1

D Pa,f,D(l0, l2, g0, g1) .

Therefore, the increment can be computed as

h l0 f + g0 + g1 + 1 l0 f + g0 + g1

+ a(g0 + l2 + 1)

f + g0 + g1 + 1 a(g0 + l2)

f + g0 + g1

l0 D(f + g0 + g1 + 1) a(f a l1 g1) al0

Df(f + g0 + g1 + 1)

Pa,f,D(l0, l2, g0, g1)

(f g0 g1)[a(f + g1 l2) fl0]

Df(f + g0 + g1)(f + g0 + g1 + 1) (f a)l0 + a(f a l1 g1)

Df(f + g0 + g1 + 1)

Pa,f,D(l0, l2, g0, g1)

2af(l1 + g1) 2f(f a)l0 2a(f a)(g0 + g1)

Df(f + g0 + g1)(f + g0 + g1 + 1) Pa,f,D(l0, l2, g0, g1)

= E h2af(l1 + g1) 2f(f a)l0 2a(f a)(g0 + g1)

Df(f + g0 + g1)(f + g0 + g1 + 1)

= E h2af(f a h1) 2f(f a)l0 2a(f a)(g0 + g1 + f f)

Df(f + g0 + g1)(f + g0 + g1 + 1)

= E h 4a(f a) D(f + g0 + g1)(f + g0 + g1 + 1)

i E h 2ah1 + 2(f a)l0 D(f + g0 + g1)(f + g0 + g1 + 1)

i E h 2a(f a) Df(f + g0 + g1 + 1)

4a(f a)E0 2a E1 2(f a)E2 2a(f a)E3, (27)

E0 = E h 1 D(f + g0 + g1)(f + g0 + g1 + 1)

i , E1 = E h h1 D(f + g0 + g1)(f + g0 + g1 + 1)

E2 = E h l0 D(f + g0 + g1)(f + g0 + g1 + 1)

i , E3 = E h g2 Df(f + g0 + g1 + 1)

Note that here the expectations are taken w.r.t. the set size distribution under (a, f, D). We can expand the terms of density

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

function (26) to derive

Pa,f,D(l0, l2, g0, g1)

s=max(0,D 2f+a)

(D f s)(D f)!(f a 1)! [D (f + g0 + g1)]![(f + g0 + g1) D + s]!g1!(D f s g1)!

(a 1)! (g0 + g1 l2)![D s + l2 (f + g0 + g1)]!(f a l1 g1)!(f + g1 + l1 D + s)!l0!(a l0 1)! a!(f a)!(D f 1)!

Denote a = a 1, f = f 1, D = D 1 and l 0 = l0 1. We have

l0 D(f + g0 + g1)(f + g0 + g1 + 1)Pa,f,D(l0, l2, g0, g1)

D 1 1 D(f + g0 + g1)(f + g0 + g1 + 1)

s=max(0,D 2f +a )

(D f s)(D f )!(f a 1)! [D (f + g0 + g1)]![(f + g0 + g1) D + s]!g1!(D f s g1)!

(a 1)! (g0 + g1 l2)![D s + l2 (f + g0 + g1)]!(f a l1 g1)!(f + g1 + l1 D + s)!l 0!(a l 0 1)! a !(f a )!(D f 1)!

D 1 1 D(f + g0 + g1)(f + g0 + g1 + 1)Pa 1,f 1,D 1(l0, l2, g0, g1)

D 1 Ea 1,f 1,D 1 h 1 D(f + g0 + g1)(f + g0 + g1 + 1)

Here, the subscript means that we are taking expectation w.r.t the set sizes when the numbers of O , and points are (a 1, f 1, D 1). By symmetry, it can be shown similarly that

E1 = (f a)(f a 1)

D 1 Ea,f 1,D 1 h 1 D(f + g0 + g1)(f + g0 + g1 + 1)

i = (f a)(f a 1)

Substituting above results into (27), we obtain

δD = 2a(f a)[2E0 f 2

To compute E0, note that with a, f and D fixed, variable g2 is distributed as hyper(D 1, D f, D f 1). For E, the distribution becomes hyper(D 2, D f, D f 1). Since f + g0 + g1 = D g2, we know that

s=max(0,D 2f)

1 D(D s)(D s + 1)

D f 1 s f D f s

s=max(0,D 2f)

1 D(D s)(D s + 1) (D f 1)!f! s!(D f s 1)!(D f s)!( D + 2f + s)! (D f)!(f 1)!

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

s=max(0,D 2f+1)

1 D(D s)(D s + 1)

D f 1 s f 1 D f s

s=max(0,D 2f+1)

1 D(D s)(D s + 1) (D f)!(f 2)!

(D f 1)!(f 1)! s!(D f s 1)!(D f s)!( D + 2f + s 1)!.

For D f, we have

s=max(0,D 2f)

(f 2)(D 1)( D + 2f + s) D(D 1)f(f 1)(D s)(D s + 1)

(D f 1)!f! s!(D f s 1)!(D f s)!( D + 2f + s)! (D f)!(f 1)!

E h (f 2)(f (D f g2)) Df(f 1)(D g2)(D g2 + 1)

< E h (f g0 g1) Df(f + g0 + g1)(f + g0 + g1 + 1)

Consequently, we have

δD > 2a(f a)E h 2 D(f + g0 + g1)(f + g0 + g1 + 1) f g0 g1 Df(f + g0 + g1)(f + g0 + g1 + 1)

f + g0 + g1 Df(f + g0 + g1)(f + g0 + g1 + 1)

and note that this holds for D K. The proof is now complete.

A.6. Proof of Theorem 3.4

Theorem 3.4 (Uniform Superiority). For any two binary vectors v, w {0, 1}D with J = 0 or 1, it holds that V ar[ ˆJσ,π(v, w)] < V ar[ ˆJMH(v, w)].

Proof. By assumption we have 0 < a < f. To compare V ar[ ˆJσ,π] with V ar[ ˆJMH] = J(1 J)

K + (K 1)J2

K J2, it suffices to compare E with J2. When D = f, we know that the location vector x of (v, w) contains no elements. It is easy to verify that in this case, |G0| = |G1| = |L2| = 0, and |L0| follows hyper(f 1, a, a 1). By Theorem 3.1, it follows that when D = f,

f E[|L0|] = a(a 1)

f(f 1) = J J < J2.

Recall the definition J = a 1

f 1, which is always smaller than J. On the other hand, as D , we have |L0| 0, |L2| a, |G0| a and |G1| f a. We can show that

ED J2, as D .

By Lemma 3.3, the sequence ( Ef, Ef+1, Ef+2, ...) is strictly increasing. Since it is convergent with limit J2, by the Monotone Convergence Theorem we know that ED < J2, D f.

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

A.7. Proof of Proposition 3.6

Proposition 3.5 (Consistent Improvement). Suppose f is fixed. In terms of a, the variance ratio ρ(a) = V ar[ ˆ JMH(v,w)] V ar[ ˆ Jσ,π(v,w)] is constant for any 0 < a < f.

Proof. Let E be defined as in Theorem 3.1. Assume that D and f are fixed and a is variable. Firstly, we can write the variance ratio explicitly as

K J K + (K+1) E

1 J (K 1)(J E J ) .

We now show that the term J E J = C(1 J), where C is some constant independent of J (i.e., a). Then, for fixed D and f, by cancellation ρ(a) would be constant for all 0 < a < f. We have

f Ea,f,D h fl0 a(f + g0 + g1) + g0 + l2 f + g0 + g1

= E ha2(f + g0 + g1) f 2l0 af(g0 + l2)

af(f + g0 + g1)

= E ha(a f)(g0 + g1) + a2f + afg1 f 2l0 afl2

af(f + g0 + g1)

= E ha(a f)(g0 + g1) + af(l0 + l1) + afg1 f 2l0

af(f + g0 + g1)

= E ha(a f)(g0 + g1) + f(a f)l0 + af(f a h1)

af(f + g0 + g1)

where we use the constraints (11) that l0 + l1 + l2 = a and l1 + g1 + h1 = f a. We now study the three terms respectively. We have

E ha(a f)(g0 + g1)

af(f + g0 + g1)

i = (1 J)E h g0 + g1 f + g0 + g1

We have shown in the proof of Lemma 3.3 that

Ea,f,D h l0 f + g0 + g1

D 1 Ea 1,f 1,D 1 h 1 f + g0 + g1

and by symmetry it holds that

Ea,f,D h h1 f + g0 + g1

i = (f a)(f a 1)

Since f is fixed, (|G0| + |G1|) is distributed independently of a. Consequently, E and E are both independent of a. Thus, it follows that

E h f(a f)l0 af(f + g0 + g1)

i = (1 J)f(a 1)

additionally,

E h af(f a h1)

af(f + g0 + g1)

i = (1 J)f E (1 J)f(f a 1)

Summing up the terms and substituting into (28), we derive

J E J = C(1 J),

where C = E + (f f(f 2)

D 1 )E , which is independent of a. Taking into ρ(a), we get

ρ(a) = 1 J 1 J (K 1)C(1 J) = 1 1 (K 1)C ,

which is a constant only depending on f, D and K. This completes the proof.

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

B. More Numerical Justification on C-Min Hash-(π, π)

The Words dataset (Li and Church, 2005) (which is publicly available) contains a large number of word vectors, with the i-th entry indicating whether this word appears in the i-th document, for a total of D = 216 documents. The key statistics of the 120 selected word pairs are presented in Table 1. Those 120 pairs of words are more or less randomly selected except that we make sure they cover a wide spectrum of data distributions. Denote d as the number of non-zero entries in the vector. Table 1 reports the density d = d/D for each word vector, ranging from 0.0006 to 0.6. The Jaccard similarity J ranges from 0.002 to 0.95.

In Figures 8 - 15, we plot the empirical MSE along with the empirical bias2 for ˆJπ,π, as well as the empirical MSE for ˆJσ,π. Note that for D this large, it is numerically difficult to evaluate the theoretical variance formulas. From the results in the Figures, we can observe

For all the data pairs, the MSE of C-Min Hash-(π, π) estimator overlaps with the empirical MSE of C-Min Hash-(σ, π) estimator for all K from 1 up to 4096.

The bias2 is several orders of magnitudes smaller than the MSE, in all data pairs. This verifies that the bias of ˆJπ,π is extremely small in practice and can be safely neglected.

We have many more plots on more data pairs. Nevertheless, we believe the current set of experiments on this Words dataset should be sufficient to verify that, the proposed C-Min Hash-(π, π) could give indistinguishable Jaccard estimation accuracy in practice compared with C-Min Hash-(σ, π).

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

100 101 102 103 104

CALORIES - MICROSOFT

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

CAN - SEARCH

2 Perm 1 Perm

100 101 102 103 104

COMMITTED - PRODUCTIVE

2 Perm 1 Perm

100 101 102 103 104

CONTEMPORARY - FLASH

2 Perm 1 Perm

100 101 102 103 104

CONVENIENTLY - INDUSTRIES

2 Perm 1 Perm

100 101 102 103 104

COPYRIGHT - AN

2 Perm 1 Perm

100 101 102 103 104

CREDIT - CARD

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

EARTH - GROUPS

2 Perm 1 Perm

100 101 102 103 104

EXPRESSED - FRUSTRATED

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

FIND - SITE

2 Perm 1 Perm

100 101 102 103 104

FIXED - SPECIFIC

2 Perm 1 Perm

Figure 9. Empirical MSEs of C-Min Hash-(π, π) ( 1 Perm , red, solid) vs. C-Min Hash-(σ, π) ( 2 Perm , blue, dashed) on various data pairs from the Words dataset. We also report the empirical bias2 for C-Min Hash-(π, π) to show that the bias is so small that it can be safely neglected. The empirical MSE curves for both estimators essentially overlap for all data pairs.

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

100 101 102 103 104

FLIGHT - TRANSPORTATION

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

FRANCISCO - SAN

2 Perm 1 Perm

100 101 102 103 104

GOOD - BACK

2 Perm 1 Perm

100 101 102 103 104

GROUPS - ORDERED

2 Perm 1 Perm

100 101 102 103 104

HAPPY - CONCEPT

2 Perm 1 Perm

100 101 102 103 104

HAVE - FIRST

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

HILL - ASSURED

2 Perm 1 Perm

100 101 102 103 104

HOME - SYNTHESIS

2 Perm 1 Perm

100 101 102 103 104

HONG - KONG

2 Perm 1 Perm

100 101 102 103 104

HOSTED - DRUGS

2 Perm 1 Perm

100 101 102 103 104

INTERVIEWS - FOURTH

2 Perm 1 Perm

100 101 102 103 104

KANSAS - PROPERTY

2 Perm 1 Perm

100 101 102 103 104

KIRIBATI - GAMBIA

2 Perm 1 Perm

Figure 10. Empirical MSEs of C-Min Hash-(π, π) ( 1 Perm , red, solid) vs. C-Min Hash-(σ, π) ( 2 Perm , blue, dashed) on various data pairs from the Words dataset. We also report the empirical bias2 for C-Min Hash-(π, π) to show that the bias is so small that it can be safely neglected. The empirical MSE curves for both estimators essentially overlap for all data pairs.

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

100 101 102 103 104

LAST - THIS

2 Perm 1 Perm

100 101 102 103 104

LEAST - ROMANCE

2 Perm 1 Perm

100 101 102 103 104

LIME - REGISTERED

2 Perm 1 Perm

100 101 102 103 104

LINKS - TAKE

2 Perm 1 Perm

100 101 102 103 104

LINKS - THAN

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

MAIL - BACK

2 Perm 1 Perm

100 101 102 103 104

MAKE - LIKE

2 Perm 1 Perm

100 101 102 103 104

MANAGING - LOCK

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

MASS - DREAM

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

MOST - HOME

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

NEITHER - FIGURE

2 Perm 1 Perm

Figure 11. Empirical MSEs of C-Min Hash-(π, π) ( 1 Perm , red, solid) vs. C-Min Hash-(σ, π) ( 2 Perm , blue, dashed) on various data pairs from the Words dataset. We also report the empirical bias2 for C-Min Hash-(π, π) to show that the bias is so small that it can be safely neglected. The empirical MSE curves for both estimators essentially overlap for all data pairs.

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

NEW - PLEASE

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

NEWS - LIKE

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

NOTORIOUSLY - LOCK

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

OPPORTUNITY - COUNTRIES

2 Perm 1 Perm

100 101 102 103 104

WALKED - ANTIVIRUS

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

OVER - BACK

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

PEAK - SHOWS

2 Perm 1 Perm

Figure 12. Empirical MSEs of C-Min Hash-(π, π) ( 1 Perm , red, solid) vs. C-Min Hash-(σ, π) ( 2 Perm , blue, dashed) on various data pairs from the Words dataset. We also report the empirical bias2 for C-Min Hash-(π, π) to show that the bias is so small that it can be safely neglected. The empirical MSE curves for both estimators essentially overlap for all data pairs.

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

100 101 102 103 104

PEOPLE - BY

2 Perm 1 Perm

100 101 102 103 104

PEOPLE - INFO

2 Perm 1 Perm

100 101 102 103 104

PICKS - BOOST

2 Perm 1 Perm

100 101 102 103 104

PLANET - REWARD

2 Perm 1 Perm

100 101 102 103 104

PLEASE - MAKE

2 Perm 1 Perm

100 101 102 103 104

PREFER - PUEDE

2 Perm 1 Perm

100 101 102 103 104

PRIVACY - FOUND

2 Perm 1 Perm

100 101 102 103 104

PROSECUTION - MAXIMIZE

2 Perm 1 Perm

100 101 102 103 104

RECENTLY - INT

2 Perm 1 Perm

100 101 102 103 104

REPLY - ACHIEVE

2 Perm 1 Perm

100 101 102 103 104

RESERVED - BEEN

2 Perm 1 Perm

100 101 102 103 104

RIGHTS - FIND

2 Perm 1 Perm

100 101 102 103 104

RIGHTS - RESERVED

2 Perm 1 Perm

100 101 102 103 104

SCENE - ABOUT

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

Figure 13. Empirical MSEs of C-Min Hash-(π, π) ( 1 Perm , red, solid) vs. C-Min Hash-(σ, π) ( 2 Perm , blue, dashed) on various data pairs from the Words dataset. We also report the empirical bias2 for C-Min Hash-(π, π) to show that the bias is so small that it can be safely neglected. The empirical MSE curves for both estimators essentially overlap for all data pairs.

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

100 101 102 103 104

SEIZE - ANYTHING

2 Perm 1 Perm

100 101 102 103 104

SHOULDERS - GORGEOUS

2 Perm 1 Perm

100 101 102 103 104

SICK - FELL

2 Perm 1 Perm

100 101 102 103 104

SITE - CELLULAR

2 Perm 1 Perm

100 101 102 103 104

SOLD - LIVE

2 Perm 1 Perm

100 101 102 103 104

SOLO - CLAIMS

2 Perm 1 Perm

100 101 102 103 104

SOON - ADVANCE

2 Perm 1 Perm

100 101 102 103 104

SPECIALIZES - ACTUAL

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

STATES - UNITED

2 Perm 1 Perm

100 101 102 103 104

TATTOO - JEANS

2 Perm 1 Perm

100 101 102 103 104

THAT - ALSO

2 Perm 1 Perm

100 101 102 103 104

THIS - CITY

2 Perm 1 Perm

100 101 102 103 104

THEIR - SUPPORT

2 Perm 1 Perm

100 101 102 103 104

THEIR - VIEW

2 Perm 1 Perm

Figure 14. Empirical MSEs of C-Min Hash-(π, π) ( 1 Perm , red, solid) vs. C-Min Hash-(σ, π) ( 2 Perm , blue, dashed) on various data pairs from the Words dataset. We also report the empirical bias2 for C-Min Hash-(π, π) to show that the bias is so small that it can be safely neglected. The empirical MSE curves for both estimators essentially overlap for all data pairs.

C-Min Hash: Improving Minwise Hashing with Circulant Permutation

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

THINKS - LOT

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

TIME - WELL

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

TOP - COPYRIGHT

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

VIEW - PEOPLE

2 Perm 1 Perm

100 101 102 103 104

2 Perm 1 Perm

100 101 102 103 104

WEEKS - LONDON

2 Perm 1 Perm

Figure 15. Empirical MSEs of C-Min Hash-(π, π) ( 1 Perm , red, solid) vs. C-Min Hash-(σ, π) ( 2 Perm , blue, dashed) on various data pairs from the Words dataset. We also report the empirical bias2 for C-Min Hash-(π, π) to show that the bias is so small that it can be safely neglected. The empirical MSE curves for both estimators essentially overlap for all data pairs.