# you_never_cluster_alone__550bdd79.pdf

You Never Cluster Alone

Yuming Shen 1, Ziyi Shen2, Menghan Wang3, Jie Qin4, Philip H.S. Torr1, and Ling Shao5

1University of Oxford 2University College London 3e Bay 4Nanjing University of Aeronautics and Astronautics 5Inception Institute of Artiﬁcial Intelligence ymcidence@gmail.com

Recent advances in self-supervised learning with instance-level contrastive objectives facilitate unsupervised clustering. However, a standalone datum is not perceiving the context of the holistic cluster, and may undergo sub-optimal assignment. In this paper, we extend the mainstream contrastive learning paradigm to a cluster-level scheme, where all the data subjected to the same cluster contribute to a uniﬁed representation that encodes the context of each data group. Contrastive learning with this representation then rewards the assignment of each datum. To implement this vision, we propose twin-contrast clustering (TCC). We deﬁne a set of categorical variables as clustering assignment conﬁdence, which links the instancelevel learning track with the cluster-level one. On one hand, with the corresponding assignment variables being the weight, a weighted aggregation along the data points implements the set representation of a cluster. We further propose heuristic cluster augmentation equivalents to enable cluster-level contrastive learning. On the other hand, we derive the evidence lower-bound of the instance-level contrastive objective with the assignments. By reparametrizing the assignment variables, TCC is trained end-to-end, requiring no alternating steps. Extensive experiments show that TCC outperforms the state-of-the-art on challenging benchmarks.

1 Introduction Ancestored by various similarity-based [59] and feature-based [4, 56] approaches, unsupervised deep clustering jointly optimizes data representations and cluster assignments [81]. A recent fashion in this domain takes inspiration from contrastive learning in computer vision [11, 12, 24], leveraging the effectiveness and simplicity of discriminative feature learning. This strategy is experimentally reasonable, as previous research has found that the learnt representations reveal data semantics and locality [34, 80]. Even a simple migration of contrastive learning signiﬁcantly improves clustering performance, of which examples include a two-stage clustering pipeline [73] with contrastive pretraining and k-means [56] and a composition of an Info NCE loss [62] and a clustering one [81] in [89]. Compared with the deep generative counterparts [16, 36, 54, 86], contrastive clustering is free from decoding and computationally practical, with guaranteed feature quality.

However, have we been paying too much attention to the representation expressiveness of a single data point? Intuitively, a standalone data point, regardless of its feature quality, cannot tell us much about how the cluster looks like. Fig. 1 illustrates a simple analogy using the Two Moons dataset. Without any context for the crescents, it is difﬁcult to assign a data point to either of the two clusters based on its own representation, as the point can be inside one moon but still close to the other. Accordingly, observing more data reveals more about the holistic distributions of the clusters, e.g., the shapes of the moons in Fig. 1, and thus heuristically beneﬁts clustering. Although we can implicitly parametrize the context of the clusters by the model itself, e.g., using a Gaussian mixture model (GMM) [4] or encoding this information by deep model parameters, explicitly representing the context yields the most common deep learning practice. This further opens the door for learning cluster-level representations with all corresponding data points. Namely, you never cluster alone.

A part of this work was done when the author was with e Bay.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Latent Context: The Upper Moon

Latent Context: The Lower Moon

Learning for Context Awareness

Assignment without Context

Figure 1: The motivation behind this work. (a): When assessing a standalone data point, cluster assignment can be challenging due to the non-linearity of the feature space and the lack of context information for the data distribution. (b): Our model learns this context by representing each cluster with latent features.

In this paper, motivated by the thought experiment above, we develop a multi-granularity contrastive learning framework, which includes an instance-level granularity and a cluster-level one. The former learning track conveys the conventional functionalities of contrastive learning, i.e., learning compact data representations and preserving underlying semantics. We further introduce a set of latent variables as cluster assignments, and derive an evidence lower bound (ELBO) of instance-level Info NCE [62]. As per the cluster-level granularity, we leverage these latent variables as weights to aggregate all the corresponding data representations as the set representation [87] of a cluster. We can then apply contrastive losses to all clusters, thereafter rewarding/updating the cluster assignments. Abbreviated as twin-contrast clustering (TCC), our work delivers the following contributions:

We develop the novel TCC model, which, for the ﬁrst time, shapes and leverages a uniﬁed representation of the cluster semantics in the context of contrastive clustering. We deﬁne and implement the cluster-level augmentations in a batch-based training and stochastic labelling procedure, which enables on-the-ﬂy contrastive learning on clusters. We achieve signiﬁcant performance gains against the state-of-the-art methods on ﬁve benchmark datasets. Moreover, TCC can be trained from scratch, requiring no pre-trained models or auxiliary knowledge from other domains.

2 Preliminaries

2.1 Contrastive Learning

Contrastive learning, as the name suggests, aims to distinguish an instance from all others using embeddings, with the dot-product similarity typically being used as the measurement. Let X = {xi}N i=1 be an N-point dataset, with the i-th observation xi Rdx. An arbitrary transformation f : Rdx Rdm encodes each data point into a dm-dimensional vector. With the index i being the identiﬁer, the Info NCE loss [62] discriminates xi from the others through a softmax-like likelihood:

log p (i|xi) = log exp (v i f(xi)/τ) PN j=1 exp v j f(xi)/τ . (1)

A temperature hyperparameter τ controls the concentration level [26, 80]. V = {vi}N i=1 refers to the vocabulary of the dataset, which is usually based on the data embeddings under different augmentations. As caching the entire V is not practical for large-scale training, existing works propose surrogates of Eq. (1), e.g., replacing V with a memory bank [80], queuing it with a momentum network [12, 24], or training with large batches [11].

2.2 Deep Set Representations

To learn the representations of sets, we need to consider permutation-invariant transformations. Zaheer et al. [87] showed that all permutation-invariant functions T( ) applied to a set X generally fall into the following form:

T(X) = g XN

i=1 h(xi) , (2)

Cluster Representation

Instance-Level Representation & Contrast

Scala-Vector

Multiplication Sampling Vector Sum Fully-Connected

Inference Model

CNN Backbone with

Random Augmentation

Figure 2: The schematic of TCC. Here, we demonstrate a 2-way clustering example with three images. For simplicity, we only illustrate the instance-level learning module on the last image, while it is applied to all images.

where g( ) and h( ) are arbitrary continuous transformations. Note that the aggregation above can be executed on a weighted basis, which is typically achieved by the attention mechanism between the set-level queries and instance-level keys/values [31, 49]. Our design on each cluster is partially inspired by this, as back-propagation does not support hard instance-level assignment. Next, we deﬁne our cluster representation along with the clustering procedure, and describe how it is trained with contrastive learning.

3 Twin-Contrast Clustering We consider a K-way clustering problem, with K being the number of clusters. Let k {1, , K} denote the entry of a cluster that xi may belong to, and the categorical variable πi = [πi(1), , πi(K)] indicate the cluster assignment probabilities of xi. Following common practice, we regard image clustering as our target for simplicity. Fig. 2 provides a schematic of TCC. The cluster-level contrast track reﬂects our motivation from Sec. 1, while the instance-level one learns the semantics of each image. We bridge these two tracks with the inference model πi(k) = qθ(k|xi) so that both losses reward and update the assignment on each xi. In particular, qθ(k|xi) is parametrized by a softmax operation:

πi(k) = qθ(k|xi) = exp(µ k fθ(xi)) PK k =1 exp(µ k fθ(xi)) , (3)

where fθ( ) is a convolutional neural network (CNN) [47, 48] built upon random data augmentations, producing dm-dimensional features. We denote µθ = {µk}K k=1 as a set of trainable cluster prototypes, where θ refers to the collection of all parameters. In Sec. 3.1, we leverage qθ(k|xi) to aggregate cluster features, and in Sec. 3.2 we derive the ELBO of Eq. (1) with qθ(k|xi). In the following, we will omit the index i for brevity when x and π clearly correspond to a single data point.

3.1 Representing and Augmenting the Context for Cluster-Level Contrast

Cluster-Level Representation We implement Eq. (2) for each cluster using soft aggregation, where qθ(k|x) weighs the relevance of the data to the given cluster. Denoted as rk, the k-th cluster representation is computed by:

hθ(x; k) = π(k) fθ(x),

rk = Tθ(X; k) = XN

i=1 hθ(xi; k) . XN

i=1 hθ(xi; k) 2 , (4)

where 2 refers to the L2-norm. We adopt L2 normalization here for two main purposes. First, the summation of π(k) = qθ(k|x) along x is not self-normalized. Second, and more importantly, as shown in [11, 12, 24], L2-normalized features beneﬁt contrastive learning.

The Anchored Cluster Semantics Intuitively, π(k) reﬂects the degree of relevance of a datum to the k-th cluster. With it being the aggregation weight, rk represents the information that is related to the corresponding prototype µk. In other words, our design treats each µk as the semantic anchor that queries the in-coming batch to form a representation describing a certain latent topic.

Cluster-Level Augmentation Equivalents Contrastive learning is usually employed alongside random data augmentation [11, 24] to obtain positive candidates. Though deﬁning a uniform augmentation scheme for sets is beyond the scope of this paper, the proposed model reﬂects clusterlevel augmentation in its design by the following heuristics:

(a) Augmentation on elements. TCC implicitly inherits existing image augmentation techniques (such as cropping, color jittering, random ﬂipping, and grayscale conversion) by implementing them using fθ( ).

(b) Irrelevant minorities. We consider injecting a small proportion of irrelevant data into a cluster representation, while keeping the main semantics of the cluster unchanged. Eq. (4) turns out to be an equivalent to this. As the softmax product π(k) is always positive, those data that are not very related to the given cluster still contribute to the cluster s representation, which compiles the irrelevant. Meanwhile, these irrelevant data are not dominating the value of the cluster representation, because the small value of π(k) scales the feature magnitude during aggregation, which counts the minority.

(c) Subsetting. Empirically, a subset of a cluster holds the same semantics as the original. Batch-based training samples data at each step, which naturally creates subsets for each cluster.

We experimentally ﬁnd the above augmentation equivalents are sufﬁcient for the clustering task. On the other hand, since Eq. (4) is permutation-invariant, reordering the sequence of data does not yield a valid augmentation.

Cluster-Level Contrastive Learning We deﬁne a simple contrastive objective that preserves the identity of each cluster against the rest. Having everything in a batch, e.g., a Sim CLR-like framework [11], does not allow augmentation (c) to be fully utilized in the loss, since the two augmented counterparts rk and ˆrk from a batch may form part of the same subset of a cluster. Hence, we opt for the Mo Co-like solution [24], employing an L-sized memory queue P = {pl}L l=1 to cache negative samples and a momentum network to produce ˆrk = Tˆθ(X; k). P stores each cluster representation under different subsets, training with which preserves the temporal semantic consistency [45] of clusters. Our cluster-level objective minimizes the following negative loglikelihood (NLL):

L1 = Ek[ log pθ(k|rk)] = 1

k=1 log exp(ˆr krk/τ)

exp(ˆr krk/τ) + PL l=1 exp(p l rk/τ)1(l mod K = k) ,

(5) where 1( ) is an indicator function and mod is the modular operator. Since the cluster number K can be less than the queue size L, we exclude the features that represent the same cluster as k in the negative sample collection P by inserting the indicator function into the loss above.

3.2 Instance-Level Contrast with Cluster Assignments

The ELBO We propose to reuse the inference model qθ(k|x) discussed above to compute the instance-level contrastive loss, so that the clustering process can beneﬁt from contrastive learning. Let us start from the following ELBO of log pθ(i|x) in Eq. (1):

log pθ(i|x) Eqθ(k|x) [log pθ(i|x, k)] KL (qθ(k|x)||pθ(k|x)) , (6)

where KL( ) is the Kullback Leibler (KL) divergence. We derive this ELBO in Appendix A. The true distribution pθ(k|x) is not available under the unsupervised setting. We follow [40, 69] and use a ﬁxed prior instead. In practice, we employ the uniform distribution, i.e., pθ(k|x) := pθ(k) = 1/K. Then, the KL term above can be reduced to a simple form KL (qθ(k|x)||pθ(k|x)) = log K +H(qθ(k|x)). Empirically, this encourages an evenly distributed cluster assignment across the dataset.

Regarding the expectation term in Eq. (6), back-propagation through the discrete entry k is not feasible. We resort to the Gumbel softmax trick [32, 57] as relaxation. Speciﬁcally, a latent variable c (0, 1)K is assigned to each x as a replacement. Each entry c(k) yields the reparametrization c(k) = Softmaxk((log π(k) + ϵ(k))/λ), where ϵ Gumbel(0, 1) and λ is another temperature hyperparameter. Hence, we obtain the surrogate Eqθ(k|x) [log pθ(i|x, k)] Eϵ [log pθ(i|x, c)] and the gradients can be estimated with Monte Carlo.

Instance-Level Contrastive Learning In alignment with Eq. (5), log pθ(i|x, c) learns the representation of x on a momentum contrast basis [24] by deﬁning the following transformation:

e = (fθ(x)+ NNθ(c))/ fθ(x) + NNθ(c) 2,

log pθ(i|x, c) = log exp(ˆe e/τ)

exp(ˆe e/τ) + PJ j=1 exp(q j e/τ) , (7)

where NNθ( ) denotes a single fully connected network. We accordingly use ˆe to indicate the representation of x processed by a momentum network fˆθ( ) and NNˆθ( ) under different augmentations and Gumbel samplings [32, 57]. A J-sized memory queue Q = {qj}J j=1 is also introduced to cache negative samples, updated by ˆe. In this way, we obtain the instance-level loss:

L2 = Ei[ Eϵi [log pθ(i|xi, ci)] H(qθ(k|xi)) log K]. (8)

3.3 Training and Inference

Algorithm 1: Training Algorithm of TCC

Input: Dataset X = {xi}N i=1 Output: Network parameters θ. Initialize ˆθ = θ repeat

Randomly select a mini-batch from X for each xi in the batch do

Randomly augment xi twice Sample ci with Gumbel softmax end L Eq. (9) θ θ Γ ( θL) Update the queues P with ˆr and Q with ˆe Update ˆθ with momentum moving average until convergence or reaching max iteration;

TCC enables end-to-end training from scratch. Our learning objective is a simple convex combination of Eq. (5) and (8), i.e.,

L = αL1 + (1 α)L2. (9)

The hyperparameter α (0, 1) controls the contributions of the two contrastive learning tracks. As discussed in Sec. 3.1, L is computed following a batch-based routine. For each data point x, we obtain only one sample c from the Gumbel distribution at each step, since this is usually sufﬁcient for long-term training [39, 40, 69]. One may also regard this stochasticity as an alternative to data augmentation. The overall training algorithm is shown in Alg. 1. Here, Γ( ) indicates an arbitrary stochastic gradient descent (SGD) optimizer. All trainable components are subscripted by θ, while those marked with ˆθ are the network momentum counterparts to be updated with momentum moving average. Inference with TCC only requires disabling random data augmentation and then computing argmaxk qθ(k|x).

Complexity When sampling once for each x during training, the time complexity for Eq. (9) is O(L+J), while the memory complexity for the memory bank turns out to be the same. Here we omit the complexity introduced by the CNN backbone and dot-product computation, as it is orthogonal to the design. Compared with the recent mixture-of-expert approach [73], which requires a time and memory complexity of O(KJ), TCC is trained in a more efﬁcient way.

3.4 Relations to Existing Works

Mi CE [73] also proposes a lower bound for the instance-level contrastive objective. However, it does not directly reparametrize the variational model qθ(k|x) for lower-bound computation and inference, but instead employs a K-expert solution with EM. This design is less efﬁcient than TCC since each data point needs to be processed by all K experts. Moreover, Mi CE [73] does not consider cluster-level discriminability. SCL [28] follows a similar motivation to TCC in cluster-level discriminability, but it implements this with an instance-to-set similarity, while our model learns a uniﬁed representation for each cluster. Furthermore, in SCL [28], the clustering inference model is disentangled from the instance-level contrastive objective. In contrast, the inference model qθ(k|x) of TCC contributes to instance-level discrimination (Eq. (6)).

We recently ﬁnd CC [53] comes with a cluster-level contrastive loss as well. It utilizes the in-batch inference results [π1(k), , πn(k)] to describe the k-th cluster. However, this procedure is not literally learning the cluster representation, since it is not permutation free. Re-ordering the batch may shift the semantics of the produced feature. We mitigate this issue with deep sets [87] and the empirical cluster-level augmentations for temporal consistency [45]. A similar problem is witnessed in [66]. In addition, our instance-level discrimination model yields a more general case than the one

of CC [53]. When removing the stochasticity and enforcing pθ(i|x, c) := pθ(i|c) in our model, L2 reduces to the one of [53]. We experimentally show that our design preserves more data semantics, and thus beneﬁts clustering. Being not related to our main contribution, we provide more elaboration on this in Appendix B under the framework of variational information bottleneck [2].

4 Related Work

Deep Clustering In addition to the classic approaches [4, 13, 18, 20, 43, 56, 59, 78, 91], the concept of simultaneous feature learning and clustering with deep models can be traced back to [81, 83]. The successors, including [9, 23, 61, 68, 77, 79, 85], have continuously improved the performance since. As a conventional option for unsupervised learning, deep generative models are also widely adopted in clustering [10, 16, 35, 36, 42, 51, 58, 86, 90], usually backboned by VAE [39] and GAN [21]. However, generators are computationally expensive for end-to-end training, and often less effective than the discriminative models [15, 27, 33] in feature learning [11]. Recent research has considered contrastive learning in clustering [28, 53, 73, 74, 89]. We discuss the drawback of them and their relations to TCC in Sec. 3.4.

Contrastive Learning Contrastive learning learns compact image representations in a selfsupervised manner [11, 12, 24, 62, 72]. There are various applications for this paradigm [34, 41, 82, 93]. We note that several contrastive learning approaches [7, 52] conceptually involve a clustering procedure. Nevertheless, they are based on a uniﬁed pre-training framework to beneﬁt the downstream tasks, instead of delivering a speciﬁc clustering model.

Set Representations Our cluster-level representation (Eq. (4)) is a realization of deep sets [17, 87]. Existing research in this area mainly focuses on set-level tasks [19, 31, 37, 76, 84]. It is also notable that, though we leverage cluster-level representation learning, TCC is still an instance-level clustering model, which is different from the set-level clustering models [49, 50, 63].

5 Experiments

5.1 Settings

Table 1: Dataset settings for our experiments. Dataset Images Clusters (K) Input Size CIFAR-10 [44] 60,000 10 32 32 CIFAR-100 [44] 60,000 20 32 32 STL-10 [14] 13,000 10 96 96 Image Net-10 [9] 13,000 10 96 96 Image Net-Dog [9] 19,500 15 96 96

We follow the recent works [29, 33] and report the performance of TCC in terms of clustering accuracy (ACC) [81], normalized mutual information (NMI) [70] and adjusted random index (ARI) [30]. For fair comparison with existing works, we do not use any supervised pre-trained models. The experiments are conducted on ﬁve benchmark datasets, including CIFAR-10/100 [44], Image Net-10/Dog [9] and STL-10 [14]. Note that Image Net-10/Dog [9] is a subset of the original Image Net dataset [67]. Since most existing works have pre-deﬁned cluster numbers, we adopt this practice and follow their training/test protocols [29, 33, 61, 73]. Tab. 1 depicts the details of the settings.

5.2 Implementation Details

TCC is implemented with the deep learning toolbox Tensor Flow [1]. We choose the Mo Co-style random image augmentations [24] for fair comparison with the recent works [28, 73]. We further link our choice of augmentations with the cluster representation temporal consistency in Sec. 3.1. Speciﬁcally, each image is successively processed by random cropping, gray-scaling, color jittering, and horizontal ﬂipping, followed by mean-std standardization. We refer to [24, 73] for more details. We employ Res Net-34 [25] as the default CNN backbone fθ( ), which is also identical to [28, 73]. Appendix C gives a full illustration of the CNN structure. The image size dx and cluster number K are ﬁxed for each dataset, as shown in Tab. 1. The feature dimensionality produced by CNN is dm = 128. Following common practice [12, 24, 11], we ﬁx the contrastive temperature τ = 1, while using a slightly lower λ = 0.8 for the Gumbel softmax trick [32, 57] to encourage concrete assignments. We implement a ﬁxed-length instance-level memory bank Q with a size of J = 12, 800 to match up with the smallest dataset in our experiments. The size of the cluster-level memory bank P is set to L = 100 K, varying from each dataset. We have α = 0.5 so that L1 and L2 provide equal contributions to training. The choice of batch size is of importance to TCC in computing

Table 2: Unsupervised clustering performance comparison with existing methods (in percentage %). We provide additional results on Tiny Image Net [46] and comparison with more contrastive baselines such as Sw AV [7] in Appendix D.

Method CIFAR-10 CIFAR-100 STL-10 Image Net-10 Image Net-Dog NMI ACC ARI NMI ACC ARI NMI ACC ARI NMI ACC ARI NMI ACC ARI AC [22] 10.5 22.8 6.5 9.8 13.8 3.4 23.9 33.2 14.0 13.8 24.2 6.7 3.7 13.9 2.1 NMF [5] 8.1 19.0 3.4 7.9 11.8 2.6 9.6 18.0 4.6 13.2 23.0 6.5 4.4 11.8 1.6 AE [3] 23.9 31.4 16.9 10.0 16.5 4.8 25.0 30.3 16.1 21.0 31.7 15.2 10.4 18.5 7.3 DAE [75] 25.1 29.7 16.3 11.1 15.1 4.6 22.4 30.2 15.2 20.6 30.4 13.8 10.4 19.0 7.8 DCGAN [65] 26.5 31.5 17.6 12.0 15.1 4.5 21.0 29.8 13.9 22.5 34.6 15.7 12.1 17.4 7.8 De CNN [88] 24.0 28.2 17.4 9.2 13.3 3.8 22.7 29.9 16.2 18.6 31.3 14.2 9.8 17.5 7.3 VAE [39] 24.5 29.1 16.7 10.8 15.2 4.0 20.0 28.2 14.6 19.3 33.4 16.8 10.7 17.9 7.9 JULE [85] 19.2 27.2 13.8 10.3 13.7 3.3 18.2 27.7 16.4 17.5 30.0 13.8 5.4 13.8 2.8 DEC [81] 25.7 30.1 16.1 13.6 18.5 5.0 27.6 35.9 18.6 28.2 38.1 20.3 12.2 19.5 7.9 DAC [9] 39.6 52.2 30.6 18.5 23.8 8.8 36.6 47.0 25.7 39.4 52.7 30.2 21.9 27.5 11.1 ADC [23] - 32.5 - - 16.0 - - 53.0 - - - - - - - DDC [8] 42.4 52.4 32.9 - - - 37.1 48.9 26.7 43.3 57.7 34.5 - - - DCCM [79] 49.6 62.3 40.8 28.5 32.7 17.3 37.6 48.2 26.2 60.8 71.0 55.5 32.1 38.3 18.2 IIC [33] 51.3 61.7 41.1 - 25.7 - 43.1 49.9 29.5 - - - - - - MMDC [68] 57.2 70.0 - 25.9 31.2 - 49.8 61.1 - 71.9 81.1 - 27.4 11.9 - PICA [29] 56.1 64.5 46.7 29.6 32.2 15.9 - - - 78.2 85.0 73.3 33.6 32.4 17.9 DCCS [92] 56.9 65.6 46.9 - - - 37.6 48.2 26.2 60.8 71.0 55.5 - - - DHOG [15] 58.5 66.6 49.2 25.8 26.1 11.8 41.3 48.3 27.2 - - - - - - GATCluster [61] 47.5 61.0 40.2 21.5 28.1 11.6 44.6 58.3 36.3 59.4 73.9 55.2 28.1 32.2 16.3 IDFD [71] 71.4 81.5 66.3 42.6 42.5 26.4 64.3 75.6 57.5 89.8 95.4 90.1 54.6 59.1 41.3 CC [53] 70.5 79.0 63.7 43.1 42.9 26.6 76.4 85.0 72.6 85.9 89.3 82.2 44.5 42.9 27.4 Mo Co baseline [73] 66.9 77.6 60.8 39.0 39.7 24.2 61.5 72.8 52.4 - - - 34.7 33.8 19.7 Mi CE [73] 73.7 83.5 69.8 43.6 44.0 28.0 63.5 75.2 57.5 - - - 42.3 43.9 28.6 TCC 79.0 90.6 73.3 47.9 49.1 31.2 73.2 81.4 68.9 84.8 89.7 82.5 55.4 59.5 41.7

the cluster-level representations. We set it to 32 K by default to ensure that sufﬁcient images can be assigned to a cluster at each step. Training TCC only requires SGD w.r.t. θ and momentum update w.r.t. ˆθ. We employ the Adam optimizer [38] with a default learning rate of 3 10 3, without learning rate scheduling. The momentum network is updated by ˆθ 0.999ˆθ + 0.001θ, where all modules subscripted by ˆθ are involved in this procedure. We train TCC for at least 1, 000 epochs on a single NVIDIA V100 GPU.

5.3 Comparison with the State-of-the-Art

60 30 0 30 60

1 2 3 4 5 6 7 8 9 10

60 30 0 30 60 60

1 2 3 4 5 6 7 8 9 10

Figure 3: t-SNE visualization on CIFAR-10 [44] and STL-10 [14].

Baselines Both deep clustering and traditional models are compared, including a Mo CO-based two-stage baseline introduced by [73]. Similar to the recent works [29, 33, 73, 79], we pick deep models that enable training from scratch and do not require supervised pre-training parameters, for fair and reasonable comparison. For this reason, baselines such as Va DE [36] and SPICE [60] are not included here. We also exclude the clustering reﬁnement approaches [64] from our comparison as they are orthogonal to our design.

Results The clustering performance (in percentage %) is shown in Tab. 2. For those baselines that are not designed for clustering [3, 21, 39], we report the results with k-means on the produced features. TCC outperforms existing works on most benchmarks. In particular, on CIFAR-10 [44], TCC outperforms the state-of-the-art methods by large margins, e.g., 7% higher in ACC than the second best one (i.e., Mi CE [73] with even stronger augmentations [11]). As a closely-related work, Mi CE [73] only considers instance-level representation learning. The performance gain of TCC over Mi CE endorses our motivation to introduce cluster-level representations. We also observe that TCC underperforms [53] on STL-10 [14] due to the exceptionally high performance of it on this dataset. However, TCC is still the runner-up on this dataset by a signiﬁcant margin, and is superior to [53] on the other four datasets. We argue that the performance on larger datasets is of more importance when comparing contrastive deep clustering methods, as contrastive learning is originally designed for large-scale tasks. Fig. 3 illustrates the t-SNE [55] scattering results of TCC on CIFAR-10 [44] and STL-10 [14].

0 0.25 0.5 0.75 1 20

Performance (%)

ARI NMI ACC

0.2 0.4 0.6 0.8 1 60

Performance (%)

ARI NMI ACC

0 50 100 150 200 60

L (to be multiplied by K)

Performance (%)

ARI NMI ACC

Batch Size (to be multiplied by K)

Performance (%)

ARI NMI ACC

(a) (b) (c) (d) Figure 4: Hyperparameter analysis results on CIFAR-10 [44]. 5.4 Ablation Study

We conduct an ablation study to validate our motivation and design, with the following baselines.

(i) Without L1. As a key component of TCC, the cluster-level contrastive learning objective L1 reﬂects our main motivation. We ﬁrst assess the model performance when removing this loss, which reduces TCC to a simple instance-level contrastive clustering model.

(ii) Without L2. We can also remove L2 to see if L1 alone still yields a valid baseline.

(iii) Multiple Sampling. As described in Sec. 3.3, we only consider a single sample c each time to compute the lower bound of the instance-level loss (Eq. (8)). We also consider applying the Gumbel softmax trick multiple times for each image. In particular, we sample 10 groups of latent variables each time to compute the expectation term Eϵ [log pθ(i|x, c)] of L2. On each batch, we enqueue the mean of ˆe w.r.t. all 10 sampled c for each image.

(iv) Without P. Since we usually have a small cluster number K, computing the cluster-level Info NCE loss does not necessarily require a memory bank to cache the negative sample surrogates. In this baseline, we remove the cluster-level memory bank P, and use the remaining K 1 cluster representations as negative samples when computing Eq. (5).

(v) Without Augmentation (a) for L1. We validate our cluster-level augmentation strategies by removing image augmentations when computing Eq. (8). Note that this baseline does not inﬂuence the instance-level objective by rendering both augmented and original images to fθ( ).

(vi) Without Augmentation (b) for L1. This baseline requires hard assignments at each step so that cluster-level aggregation only involves images that are assigned to the corresponding clusters. This modiﬁcation does not affect L2.

(vii) Without Augmentation (c) for L1. The ﬁnal baseline changes the training pipeline. Since we do not subset any clusters here, cluster representation aggregation (Eq. (4)) runs on the whole training set after each epoch. We apply alternating training procedure as follows. First, L2 optimizes the model for a full epoch. Then, we descend L1 with aggregated cluster features and repeat.

Table 3: Ablation study results (in percentage %).

Baseline CIFAR-10 NMI ACC ARI (i) Without L1 68.9 78.7 57.9 (ii) Without L2 37.1 45.4 24.5 (iii) Multiple Sampling 78.5 90.1 74.2 (iv) Without P 72.0 82.9 68.8 (v) Without Augmentation (a) for L1 73.5 85.3 69.1 (vi) Without Augmentation (b) for L1 68.5 79.2 60.6 (vii) Without Augmentation (c) for L1 69.4 80.0 62.7 TCC Full Model 79.0 90.6 73.3

Baseline Comparison Results We show the ablation study results in Tab. 3. Without L1, TCC performs similarly to the two-stage baseline with Mo Co [24] and k-means [56] reported in [73], which is slightly lower than Mi CE [73]. Interestingly, L1 does not provide instance-level discriminative information. Although it does still serve as a valid baseline, it does not perform very well (Baseline (ii)). Speciﬁcally, we experience strong degeneracy [6, 33] with this baseline, but it still produces better results than the traditional models. We also observe that having multiple samples with the Gumbel softmax trick [32, 57] for gradient estimation does not make much difference from a single-sample solution. Baseline (iv) also underperforms the original model. As discussed in previous sections, having a memory bank for cluster representations provides a way to acquire more negative samples for contrastive learning, considering the fact that the cluster number is usually limited.

Hyperparameters We evaluate the hyperparameters most essential to our design, including the loss weight α, the temperature of the Gumbel softmax λ, the cluster-level memory queue length L, and the batch size. The Info NCE temperature τ and the instance-level memory queue length J are not included here since they are not relevant to our key motivation and have been employed and evaluated in the recent works [24, 28, 73]. The corresponding results are plotted in Fig. 4. Though L1 plays an

0 200 400 600 800 1,000

Training Epochs

0 5 10 15 20

Training Time (in Hours)

0 200 400 600 800 1,000 0

Training Epochs

DEC Loss (Not Trained)

(a) (b) (c)

Figure 5: (a) and (b): On-batch ACC comparison between TCC and Mi CE [73] w.r.t. training epochs and training time respectively on CIFAR-10 [44]. (c): The values DEC loss [81] during training. Note that TCC is not trained with DEC, and we just record the ﬁgures for illustration.

essential role in the proposed model, having large values of α does not improve the performance, as the key instance-level semantics are yet learnt by L2. Only a reasonable proportion of L1, e.g., α = 0.25 or 0.5, in the overall learning objective improves the performance of our model. Further, we ﬁnd that TCC is not very sensitive to the Gumbel softmax temperature λ, while a moderate hardness of the softmax produces the best results. Empirically, a large batch size beneﬁts TCC, since more data can be involved in the subset of each cluster. Hence the aggregated features on each batch can be more representative. Fig. 4 (d) endorses this intuition. However, training with extremely large batch sizes may lead to out-of-memory problems with large images. To enable training on a single device, we opt to have a ﬁxed batch size of 32 K in all experiments.

5.5 More Results

Figure 6: Histograms of cluster assignments during training on CIFAR-10 [44].

Training Time We compare the training epochs (Fig. 5 (a)) and training time (Fig. 5 (a)) of TCC and the re-implemented version of Mi CE [73] with the same optimizer setting. As discussed in Sec. 3.3, Mi CE obtains a higher time complexity during training than TCC. This is reﬂected in Fig. 5 (b), though not linearly proportional. In addition, TCC requires less training steps than Mi CE to reach the best-performing results.

Conventional Clustering Losses During training, we also cache and observe the DEC loss [81], but we are not optimizing the model with it. In Fig. 5 (c), we show that by minimizing L (Eq. (9)), the traditional DEC loss [81] also decreases. This implicitly endorse our design.

Assumptions in Design One merit of constrastive learning is that one does not need to assume any empirical prior distribution to the feature space, which beneﬁts TCC when learning the cluster-level representations. The only assumption we employ is that the true posterior pθ(k|x) should be uniform to simplify the computation of the KL-divergence in Eq. (6). As previously discussed, this conventional relaxation [69] is intuitively valid since we generally expect evenly assigned clusters. It is illustrated in Fig. 6 that TCC achieves this during training by minimizing KL (qθ(k|x)||pθ(k|x)) = log K H(qθ(k|x)).

6 Conclusion

Inspired by the recent success of self-supervised learning, this paper proposes a multi-granularity contrastive clustering framework to exploit the holistic context of a cluster in an unsupervised manner. The proposed TCC simultaneously learns instanceand cluster-level representations by leveraging cluster assignment variables. Cluster-level augmentation equivalents are derived to enable on-the-ﬂy contrastive learning on clusters. Moreover, by reparametrizing the assignment variables, TCC can be trained end-to-end without auxiliary steps. Extensive experiments validate the superiority of TCC, which consistently outperforms competitors on the ﬁve benchmarks often by large margins, echoing our major motivation, i.e., we are not clustering alone.

Acknowledgments

This work is supported by the UKRI grant: Turing AI Fellowship EP/W002981/1 and EPSRC/MURI grant: EP/N019474/1. We also acknowledge the philanthropic support of the donors to the University of Oxford s COVID-19 Research Response Fund: BRD00230. We would like to thank the Royal Academy of Engineering and Five AI.

[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorﬂow: Large-scale machine learning on heterogeneous distributed systems. ar Xiv:1603.04467, 2016. 6 [2] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. In ICLR, 2017. 6 [3] Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al. Greedy layer-wise training of deep networks. In Neur IPS, 2007. 7 [4] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006. 1, 6 [5] Deng Cai, Xiaofei He, Xuanhui Wang, Hujun Bao, and Jiawei Han. Locality preserving nonnegative matrix factorization. In IJCAI, 2009. 7 [6] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018. 8 [7] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. In Neur IPS, 2020. 6, 7 [8] Jianlong Chang, Yiwen Guo, Lingfeng Wang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Deep discriminative clustering analysis. In CVPR, 2019. 7 [9] Jianlong Chang, Lingfeng Wang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Deep adaptive image clustering. In ICCV, 2017. 6, 7 [10] Shlomo E Chazan, Sharon Gannot, and Jacob Goldberger. Deep clustering based on a mixture of autoencoders. In MLSP, 2019. 6 [11] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020. 1, 2, 3, 4, 6, 7 [12] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. ar Xiv preprint ar Xiv:2003.04297, 2020. 1, 2, 3, 6 [13] Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8):790 799, 1995. 6 [14] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, 2011. 6, 7 [15] Luke Nicholas Darlow and Amos Storkey. Dhog: Deep hierarchical object grouping. ar Xiv preprint ar Xiv:2003.08821, 2020. 6, 7 [16] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. Deep unsupervised clustering with gaussian mixture variational autoencoders. ar Xiv preprint ar Xiv:1611.02648, 2016. 1, 6 [17] Harrison Edwards and Amos Storkey. Towards a neural statistician. In ICML, 2017. 6 [18] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In SIGKDD, 1996. 6 [19] Ji Feng and Zhi-Hua Zhou. Deep miml network. In AAAI, 2017. 6 [20] Brendan J Frey and Delbert Dueck. Clustering by passing messages between data points. science, 315(5814):972 976, 2007. 6 [21] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Neur IPS, 2014. 6, 7 [22] K Chidananda Gowda and G Krishna. Agglomerative clustering using the concept of mutual nearest neighbourhood. Pattern recognition, 10(2):105 112, 1978. 7 [23] Philip Haeusser, Johannes Plapp, Vladimir Golkov, Elie Aljalbout, and Daniel Cremers. Associative deep clustering: Training a classiﬁcation network with no labels. In German Conference on Pattern Recognition, 2018. 6, 7 [24] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020. 1, 2, 3, 4, 5, 6, 8 [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 6 [26] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. 2

[27] Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama. Learning discrete representations via information maximizing self-augmented training. In ICML, 2017. 6 [28] Jiabo Huang and Shaogang Gong. Deep clustering by semantic contrastive learning. ar Xiv preprint ar Xiv:2103.02662, 2021. 5, 6, 8 [29] Jiabo Huang, Shaogang Gong, and Xiatian Zhu. Deep semantic clustering by partition conﬁdence maximisation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8849 8858, 2020. 6, 7 [30] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classiﬁcation, 2(1):193 218, 1985.

6 [31] Maximilian Ilse, Jakub Tomczak, and Max Welling. Attention-based deep multiple instance learning. In ICML, 2018. 3, 6 [32] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017. 4, 5, 6, 8 [33] Xu Ji, João F Henriques, and Andrea Vedaldi. Invariant information clustering for unsupervised image classiﬁcation and segmentation. In ICCV, 2019. 6, 7, 8 [34] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. ar Xiv preprint ar Xiv:2102.05918, 2021. 1, 6 [35] Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. Variational deep embedding: An unsupervised and generative approach to clustering. ar Xiv preprint ar Xiv:1611.05148, 2016. 6 [36] Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. Variational deep embedding: An unsupervised and generative approach to clustering. In IJCAI, 2017. 1, 6, 7 [37] Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and Yee Whye Teh. Attentive neural processes. In ICLR, 2019. 6 [38] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 7 [39] Diederik Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014. 5, 6, 7 [40] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Neur IPS, 2014. 4, 5 [41] Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual representation learning. In CVPR, 2019. 6 [42] Andreas Kopf, Vincent Fortuin, Vignesh Ram Somnath, and Manfred Claassen. Mixture-of-experts variational autoencoder for clustering and generating from similarity-based representations. ar Xiv preprint ar Xiv:1910.07763, 2019. 6 [43] Hans-Peter Kriegel, Peer Kröger, Jörg Sander, and Arthur Zimek. Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(3):231 240, 2011. 6 [44] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report, University of Toronto, 2009. 6, 7, 8, 9 [45] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. In ICLR, 2017. 4, 5 [46] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015. 7 [47] Yann Le Cun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995. 3 [48] Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. 3 [49] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In ICML, 2019. 3, 6 [50] Juho Lee, Yoonho Lee, and Yee Whye Teh. Deep amortized clustering. ar Xiv preprint ar Xiv:1909.13433, 2019. 6 [51] Chongxuan Li, Max Welling, Jun Zhu, and Bo Zhang. Graphical generative adversarial networks. In Neur IPS, 2018. 6 [52] Junnan Li, Pan Zhou, Caiming Xiong, Richard Socher, and Steven CH Hoi. Prototypical contrastive learning of unsupervised representations. ar Xiv preprint ar Xiv:2005.04966, 2020. 6 [53] Yunfan Li, Peng Hu, Zitao Liu, Dezhong Peng, Joey Tianyi Zhou, and Xi Peng. Contrastive clustering. ar Xiv preprint ar Xiv:2009.09687, 2020. 5, 6, 7 [54] Zhihan Li, Youjian Zhao, Haowen Xu, Wenxiao Chen, Shangqing Xu, Yilin Li, and Dan Pei. Unsupervised clustering through gaussian mixture variational autoencoder with non-reparameterized variational inference and std annealing. In IJCNN, 2020. 1 [55] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579 2605, 2008. 7 [56] James Mac Queen et al. Some methods for classiﬁcation and analysis of multivariate observations. In Berkeley symposium on mathematical statistics and probability, 1967. 1, 6, 8 [57] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In ICLR, 2017. 4, 5, 6, 8

[58] Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, and Sreeram Kannan. Clustergan: Latent space clustering in generative adversarial networks. In AAAI, 2019. 6 [59] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Neur IPS, 2001. 1, 6 [60] Chuang Niu and Ge Wang. Spice: Semantic pseudo-labeling for image clustering. ar Xiv preprint ar Xiv:2103.09382, 2021. 7 [61] Chuang Niu, Jun Zhang, Ge Wang, and Jimin Liang. Gatcluster: Self-supervised gaussian-attention network for image clustering. In ECCV, 2020. 6, 7 [62] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018. 1, 2, 6 [63] Ari Pakman, Yueqi Wang, Catalin Mitelut, Jin Hyung Lee, and Liam Paninski. Neural clustering processes. In ICML, 2020. 6 [64] Sungwon Park, Sungwon Han, Sundong Kim, Danu Kim, Sungkyu Park, Seunghoon Hong, and Meeyoung Cha. Improving unsupervised image clustering with robust learning. ar Xiv preprint ar Xiv:2012.11150, 2020. 7 [65] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016. 7 [66] Jayanth Reddy Regatti, Aniket Anand Deshmukh, Eren Manavoglu, and Urun Dogan. Consensus clustering with unsupervised representation learning. In IJCNN, 2021. 5 [67] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211 252, 2015. 6 [68] Guy Shiran and Daphna Weinshall. Multi-modal deep clustering: Unsupervised partitioning of images. In ICPR, 2021. 6, 7 [69] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Neur IPS, 2015. 4, 5, 9 [70] Alexander Strehl and Joydeep Ghosh. Cluster ensembles a knowledge reuse framework for combining multiple partitions. Journal of machine learning research, 3(Dec):583 617, 2002. 6 [71] Yaling Tao, Kentaro Takagi, and Kouta Nakata. Clustering-friendly representation learning via instance discrimination and feature decorrelation. In ICLR, 2021. 7 [72] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. ar Xiv preprint ar Xiv:1906.05849, 2019. 6 [73] Tsung Wei Tsai, Chongxuan Li, and Jun Zhu. Mice: Mixture of contrastive experts for unsupervised image clustering. In ICLR, 2021. 1, 5, 6, 7, 8, 9 [74] Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, and Luc Van Gool. Scan: Learning to classify images without labels. In ECCV, 2020. 6 [75] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010. 7 [76] Edward Wagstaff, Fabian Fuchs, Martin Engelcke, Ingmar Posner, and Michael A Osborne. On the limitations of representing functions on sets. In ICML, 2019. 6 [77] Jingyu Wang, Zhenyu Ma, Feiping Nie, and Xuelong Li. Progressive self-supervised clustering with novel category discovery. IEEE Transactions on Cybernetics, 2021. 6 [78] Joe H Ward Jr. Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301):236 244, 1963. 6 [79] Jianlong Wu, Keyu Long, Fei Wang, Chen Qian, Cheng Li, Zhouchen Lin, and Hongbin Zha. Deep comprehensive correlation mining for image clustering. In ICCV, 2019. 6, 7 [80] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018. 1, 2 [81] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In ICML, 2016. 1, 6, 7, 9 [82] Yuwen Xiong, Mengye Ren, and Raquel Urtasun. Loco: Local contrastive representation learning. ar Xiv preprint ar Xiv:2008.01342, 2020. 6 [83] Bo Yang, Xiao Fu, Nicholas D Sidiropoulos, and Mingyi Hong. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In ICML, 2017. 6 [84] Bo Yang, Sen Wang, Andrew Markham, and Niki Trigoni. Robust attentional aggregation of deep feature sets for multi-view 3d reconstruction. International Journal of Computer Vision, 128(1):53 73, 2020. 6 [85] Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsupervised learning of deep representations and image clusters. In CVPR, 2016. 6, 7 [86] Linxiao Yang, Ngai-Man Cheung, Jiaying Li, and Jun Fang. Deep clustering by gaussian mixture variational autoencoders with graph embedding. In ICCV, 2019. 1, 6

[87] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. In Neur IPS, 2017. 2, 5, 6 [88] Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. Deconvolutional networks. In CVPR, 2010. 7 [89] Dejiao Zhang, Feng Nan, Xiaokai Wei, Shangwen Li, Henghui Zhu, Kathleen Mc Keown, Ramesh Nallapati, Andrew Arnold, and Bing Xiang. Supporting clustering with contrastive learning. In NAACL, 2021. 1, 6 [90] Dejiao Zhang, Yifan Sun, Brian Eriksson, and Laura Balzano. Deep unsupervised clustering using mixture of autoencoders. ar Xiv preprint ar Xiv:1712.07788, 2017. 6 [91] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efﬁcient data clustering method for very large databases. SIGMOD, 1996. 6 [92] Junjie Zhao, Donghuan Lu, Kai Ma, Yu Zhang, and Yefeng Zheng. Deep image clustering with categorystyle representation. In ECCV, 2020. 7 [93] Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning of visual embeddings. In ICCV, 2019. 6