# unsupervised_deep_learning_via_affinity_diffusion__45acf605.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Unsupervised Deep Learning via Afﬁnity Diffusion

Jiabo Huang,1 Qi Dong,1 Shaogang Gong,1 Xiatian Zhu2

1Queen Mary University of London, 2Vision Semantics Limited {jiabo.huang, q.dong, s.gong}@qmul.ac.uk, eddy.zhuxt@gmail.com

Convolutional neural networks (CNNs) have achieved unprecedented success in a variety of computer vision tasks. However, they usually rely on supervised model learning with the need for massive labelled training data, limiting dramatically their usability and deployability in real-world scenarios without any labelling budget. In this work, we introduce a general-purpose unsupervised deep learning approach to deriving discriminative feature representations. It is based on self-discovering semantically consistent groups of unlabelled training samples with the same class concepts through a progressive afﬁnity diffusion process. Extensive experiments on object image classiﬁcation and clustering show the performance superiority of the proposed method over the stateof-the-art unsupervised learning models using six common image recognition benchmarks including MNIST, SVHN, STL10, CIFAR10, CIFAR100 and Image Net.

Introduction Convolutional neural networks (CNNs) trained in a supervised fashion have signiﬁcantly boosted the state-of-the-art performance in computer vision (Krizhevsky, Sutskever, and Hinton 2012; Girshick 2015; Long, Shelhamer, and Darrell 2015). Moreover, the feature representations of a supervised CNN (e.g. trained for classiﬁcation on Image Net (Russakovsky et al. 2015)) generalise to new tasks. Despite such remarkable success, this approach is limited due to a number of stringent assumptions. First, supervised model learning requires enormous labelled datasets to be collected manually and exhaustively (Dong, Gong, and Zhu 2018; Dong, Zhu, and Gong 2019). This does not always hold valid due to high annotation costs. Besides, the beneﬁts of enlarging labelled datasets may have diminishing returns (Sun et al. 2017). Second, transfer learning becomes less effective when the target tasks signiﬁcantly differ from the source task (Goodfellow, Bengio, and Courville 2016; Dong, Gong, and Zhu 2017). Due to the only need for access of unlabelled data typically available at scale, unsupervised deep learning provides a conceptually generic and scalable solution to these limitations (Caron et al. 2018).

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

There are an increasing number of recent attempts on unsupervised deep learning (Wu et al. 2018; Huang et al. 2019; Xie, Girshick, and Farhadi 2016; Caron et al. 2018; Ji, Henriques, and Vedaldi 2019; Haeusser et al. 2018; Yang, Parikh, and Batra 2016; Yang et al. 2017; Gidaris, Singh, and Komodakis 2018; Zhang, Isola, and Efros 2017; Noroozi and Favaro 2016; Doersch, Gupta, and Efros 2015). One intuitive strategy is joint learning of feature representations and data clustering (Xie, Girshick, and Farhadi 2016; Ji, Henriques, and Vedaldi 2019; Caron et al. 2018; Yang et al. 2017). It aims to automatically discover clustering solutions during training, with each cluster hopefully capturing a speciﬁc class concept. This objective is extremely hard due to the numerous combinatorial conﬁgurations of unlabelled data alongside highly complex inter-class decision boundaries. To avoid clustering errors as well as the following propagation, instance learning is proposed (Wu et al. 2018; Bojanowski and Joulin 2017) whereby every single sample is treated as an independent class. However, this simpliﬁed supervision is often rather ambiguous particularly around class centres, therefore, leading to weak class discrimination. As an intermediate representation, tiny neighbourhoods are leveraged for preserving the advantages of both data clustering and instance learning (Huang et al. 2019). But this method is restricted by the small size of local neighbourhoods. Another representative approach is designing effective pretext tasks with supervision labels produced automatically from unlabelled data (Gidaris, Singh, and Komodakis 2018; Zhang, Isola, and Efros 2017; Noroozi and Favaro 2016; Doersch, Gupta, and Efros 2015). Due to insufﬁcient correlation with the target class supervision, these methods often result in less competitive models.

In this work, we aim to solve the algorithmic limitations of existing unsupervised deep learning methods. To that end, we propose a general-purpose Progressive Afﬁnity Diffusion (PAD) method for training unsupervised models. Requiring no prior knowledge of class number, PAD performs model-matureness-adaptive data group inference in training for more reliably revealing the underlying sample-toclass memberships. This is achieved by progressively selfdiscovering strongly connected subgraphs on a neighbourhood afﬁnity graph via faithful afﬁnity diffusion and formu-

lating the group structure aware objective loss function. The contributions of this work are summarised as the followings. (1) We propose a novel idea of leveraging strongly connected subgraphs as a self-supervision structure for more reliable unsupervised deep learning. (2) We formulate a Progressive Afﬁnity Diffusion (PAD) method for modelmatureness-adaptive discovery of strongly connected subgraphs during training through afﬁnity diffusion across adjacent neighbourhoods. This strategy maximises the class consistency of self-discovered subgraphs therefore enhancing unsupervised model learning capability. PAD is end-toend trainable. (3) We design a group structure aware objective loss formulation for more discriminative capitalising of strongly connected subgraphs in model representation learning. Comparative experiments have been extensively conducted on both image classiﬁcation and image clustering tasks using popular benchmarks: MNIST (Le Cun et al. 1998), SVHN (Netzer et al. 2011), STL10 (Coates, Ng, and Lee 2011), CIFAR10 and CIFAR100 (Krizhevsky and Hinton 2009) as well as Image Net (Russakovsky et al. 2015). The results show that our method outperforms a wide variety of existing state-of-the-art unsupervised learning models, often by large margins.

Related Work In the literature, most existing unsupervised deep learning methods fall generally into four groups: (i) joint clustering (Caron et al. 2018; Xie, Girshick, and Farhadi 2016; Yang et al. 2017; Haeusser et al. 2018; Ji, Henriques, and Vedaldi 2019; Huang et al. 2019), (ii) instance learning (Wu et al. 2018; Bojanowski and Joulin 2017), (iii) pretext task designing (Gidaris, Singh, and Komodakis 2018; Zhang, Isola, and Efros 2017; Noroozi and Favaro 2016; Wang, He, and Gupta 2017; Dahun Kim 2018; Doersch, Gupta, and Efros 2015; Zhang, Isola, and Efros 2016), and (iv) generative model formulation (Radford, Metz, and Chintala 2016; Donahue, Kr ahenb uhl, and Darrell 2016). Joint clustering methods aim to integrate the classical idea of data grouping (Aggarwal and Reddy 2013) into the end-to-end optimisation of unsupervised learning models. Often, the cluster labels self-formed are treated as concept annotations and supervised learning techniques such as softmax cross-entropy criterion are then adopted for model optimisation. Typical methods require the access of groundtruth class number which is usually unavailable in realistic scenarios. The training process may involve multiple stages continuously (Haeusser et al. 2018) or iteratively (Caron et al. 2018; Xie, Girshick, and Farhadi 2016; 2016). This approach is rather challenging due to the error-prone division of samples, especially when dealing with complex structures and distributions as typically encountered in images and videos. Our PAD method also aims to discover label consistent groups. However, unlike the clustering operation that enforces each and every sample to be clustered with a speciﬁc cluster, we take a more robust grouping scheme which associates strongly similar samples alone so that erroneous memberships can be minimised. Instance learning strategy leaps to the other extreme by regarding every single sample as a unique class (Wu et al.

2018; Bojanowski and Joulin 2017). It is motivated by the observation that supervised learning can naturally encode global manifold information over different classes from endto-end optimisation. However, by assigning samples from the same classes with different labels, this approach will push their features away, therefore, tends to dilate per-class regions and achieve less discriminative class boundaries. Evidently, this problem is clearly solved in our method due to the ability of searching label consistent groups of samples with the same class concept. Besides, instance learning can be considered as a special case of our method with the subgraph size as one. Pretext task designing has been shown as strong alternative methods for unsupervised deep learning. In general, existing methods mainly differ in the auxiliary supervision formulation, typically hand-crafted, in order to exploit information intrinsic to training data. Representative examples include spatial context (Gidaris, Singh, and Komodakis 2018; Doersch, Gupta, and Efros 2015; Noroozi and Favaro 2016), spatio-temporal continuity (Wang, He, and Gupta 2017; Dahun Kim 2018) and colour distributions (Zhang, Isola, and Efros 2016; 2017). Conceptually, existing methods present only weak linkages between auxiliary supervision and class concepts, yielding inferior models. Whilst the potential of this approach is great, how to design more discriminating pretext task remains unsolved and lacks sufﬁcient guidance. In design, our method is highly complementary as no pretext task is involved. Generative model formulation is another important approach in unsupervised learning. Fundamental methods include Restricted Boltzmann Machines (RBMs) (Tang, Salakhutdinov, and Hinton 2012; Lee et al. 2009), Autoencoder (Vincent et al. 2010; Ng 2011) and Generative Adversarial Networks (GANs) (Radford, Metz, and Chintala 2016; Donahue, Kr ahenb uhl, and Darrell 2016). As compared to discriminative models like ours, generative methods generally present weaker discrimination power due to lack of class discriminative learning. However, it is also this difference that may make their representations potentially highly complementary.

Progressive Afﬁnity Diffusion

In unsupervised learning, we have access to a set of N unlabelled training image samples I = {I1, I2, ..., IN}. We aim to train a CNN model fθ for extracting class discriminative features x, i.e. fθ : I x, where θ denotes the model parameters. The key in unsupervised learning is to bridge the semantic gap between low-level imagery appearance (e.g. pixel intensity) and high-level semantic concepts (e.g. object class). It is intrinsically challenging due to large intra-class variety and inter-class similarity in addition to unknown variations (e.g. background, pose, illumination) and highdimensional data representation. Besides, variation factors are often confounding arbitrarily (Fig 2(b)), further increasing the modelling difﬁculty drastically. This complexity exhibited in typical image collections implies the necessity to infer highly non-linear class decision boundaries in unsupervised learning.

F 0RGHO 8SGDWH G ,PSURYHG )HDWXUH 6SDFH

9HUWLFHV (GJHV &RQVWUDLQWV 6WURQJO\ FRQQHFWHG VXEJUDSK

E $IILQLW\ 'LIIXVLRQ D $IILQLW\ *UDSK

Figure 1: Overview of our Progressive Afﬁnity Diffusion (PAD) method. (a) Afﬁnity graph construction by k nearest neighbourhoods. (b) Afﬁnity diffusion across neighbourhoods to discover label consistent strongly connected subgraphs. (c) Progressive model update with self-discovered subgraphs, leading to (d) improved feature representations. The model is trained iteratively.

To overcome the challenges, we formulate Progressive Afﬁnity Diffusion (PAD). PAD aims to discover sample groups each representing one of the underlying semantic concepts as self-supervision, hence facilitating class boundary inference. For avoiding clustering errors, it is particularly designed to exploit individual pairwise relationships with capricious variations in a purely data-driven manner. Approach overview. An overview of our PAD method is depicted in Fig 1. Speciﬁcally, PAD is an iterative unsupervised model learning process including three components: (1) Afﬁnity graph construction for representing the global structure of training data, (2) Afﬁnity diffusion across neighbourhoods for self-discovering groups of samples with the same semantics, (3) Progressive model update by formulating group structure aware objective loss function. They are integrated into a multi-stage procedure. In each stage the model mines only the reliable data groups that have emerged thus far in the afﬁnity graph (i.e. modelmatureness-adaptive) other than clustering all the samples, which then feed into the subsequent model training stages sequentially. We describe the model training details at the end of this section.

Afﬁnity Graph Construction Progressive afﬁnity diffusion is carried out through graphs G of unlabelled training samples. To construct the graph, we leverage per-sample k-nearest neighbourhoods Nk deﬁned as:

Nk(x) = {xi | S(xi, x) is ranked top-k among X} (1)

where X denotes the feature set of all training samples extracted by the up-to-date model and S represents the cosine similarity function. The neighbourhoods represent a kind of data grouping (Huang et al. 2019). However, without class label supervision, such structures convey either limited (small k) or noisy (large k) afﬁnity information between training samples/vertices (Fig 2(a)). Using directly Nk for self-supervision is therefore restricted.

Afﬁnity Diffusion across Neighbourhoods To address the aforementioned problem, we propose afﬁnity diffusion across neighbourhoods. This aims to model the correlation of different Nk in order to break through their

(a) k-nearest neighbours

(b) strongly connected subgraph

Figure 2: (a) k-nearest neighbours vs. (b) strongly connected subgraph. Blue box: the anchor. Green box: with the same class as the anchor. Red box: with a different class against the anchor.

6HDUFKLQJ SDWK

$IILQLW\ *UDSK

Figure 3: Illustration of searching strongly connected subgraphs in an neighbourhood afﬁnity graph. In this example, the scale threshold s is set to 3.

barriers and spread the class identity of one sample through. Conceptually, this leverages global data manifold (Belkin, Niyogi, and Sindhwani 2006) formed collectively by all neighbourhoods. As each neighbourhood may encode visual afﬁnity information in a distinct perspective, linking them is equivalent to joining different types of variations of a concept, therefore enabling model learning to capture more comprehensive concept boundaries. For discriminative diffusion, it is critical that error introduction and propagation are minimised. To this end, motivated by graph theory (Duda, Hart, and Stork 2012; Raghavan and Yu 1981) we propose to search strongly connected subgraphs (SCS) of the afﬁnity graph G for revealing underlying semantic concepts. A SCS structure is deﬁned as a set of vertices (images) where each vertex can be reached from any others through neighbouring edges. This means that all vertices of a SCS are highly similar w.r.t. some

variation criteria. SCS structures are formed as shown in Fig 3. Starting with a random vertex xs, we construct a tree-like structure based on the edges. Then, we traverse the vertices along edges in a depth-ﬁrst search strategy. We assign nearest neighbours with the highest priority for maximising the average pairwise afﬁnity of a SCS. Depth-ﬁrst search can easily achieve this condition. It is constrained that each vertex can be visited only one time to avoid repeated traversing. To ensure reachability of any two vertices, we enforce a cyclic constraint (Fig 2(b)). In the tree-like structure, one cycle that loops back to the start vertex xs is necessary, whilst any other cycles allow to exclude xs as long as partly overlapped with the former. In the search process, we aim to ﬁnd disjoint SCS structures. Hence, we remove all the vertices of a SCS once found from G to simplify and accelerate the subsequent searching. This process repeats until no SCS exists. Note that at the end, a number of isolated vertices are likely left (outside any SCS). With the above search method, however, we ﬁnd that the resulting SCS tends to be over large even when the graph is not dense. This ends up with mixed samples of different class concepts. To address this issue, we further impose an operational size (scale) threshold s on SCS. This simple constraint works in our case considering that we are not seeking a complete group of samples per class which can be extremely challenging and risky as in clustering methods, due to complex observation variations. Even splitting a single class into multiple SCS structures, we are still able to obtain a large amount of intra-class variation information, whereas the risk of class mixture is signiﬁcantly reduced.

Progressive Model Updating Next, we describe how the self-discovered SCS structures can be used for progressive model update. For formulation ease, we treat individual samples as special SCS structures each with one sample. We propose to further convert afﬁnity measurements into probability distributions, so that maximum likelihood-based learning objective functions (Goodfellow, Bengio, and Courville 2016) can be adopted. Inspired by (Huang et al. 2019; Wu et al. 2018), we deﬁne the probability that any two samples xi and xj are drawn from the same class as:

pi,j = exp(x i xj/τ) N k=1 exp(x i xk/τ) (2)

where τ is a temperature parameter controlling the distribution concentration degree (Hinton, Vinyals, and Dean 2014). This quantity is naturally compatible with SCS formation, both relying on pairwise afﬁnity. To reinforce the SCS structural information into model learning, we maximise the same-class possibility of samples per SCS. We therefore formulate a group (subgraph) structure aware objective function as:

j C(xi) pi,j (3)

where nbs speciﬁes the mini-batch size and C(xi) the samples of the SCS structure including xi. This encourages

afﬁnity maximisation of samples within SCS. For isolated samples, C(xi) = {xi}, Lscs turns to the instance loss function (Wu et al. 2018). Eq (3) is also analogous to Neighbourhood Component Analysis (NCA) (Goldberger et al. 2005) when C(xi) is replaced with labelled sets. With summation

j C(xi) pi,j, one possible weakness of Eq (3) is that less similar neighbours can be overwhelmed due to over small quantity. To calibrate their importance, we introduce a hard positive enhancement strategy. Given a SCS, for a sample xi we deﬁne its hard positive sample as the one xhp with the smallest afﬁnity. In the case of isolated sample, we use a randomly transformed variant as its hard positive. For calibration, we minimise the Kullback-Leibler (KL) divergence of model predictions of xi and xhp as:

Lhpe = 1 nbs

j=1 pi,jlog pi,j

We obtain the ﬁnal loss function of our model by weighted summation as: Lpad = Lscs + λLhpe (5)

where the weight λ modulates their importance balance.

Model Training

PAD starts with a randomly initialised CNN. For efﬁciency, we update the afﬁnity graph and SCS per epoch, and use a memory to keep track of per-sample representations required by loss computation (Eq (2)). The memory is updated for mini-batch samples by exponential moving average (Lucas and Saccucci 1990):

xi = (1 η) xi + η xi (6)

where η denotes the update momentum, xi and xi the up-todate and memory feature vectors respectively. This can summarise the historical knowledge, learned in the past training iterations, to solve the incomplete memory limitation of neural networks (Weston, Chopra, and Bordes 2014). The whole model training procedure is summarised in Algorithm 1.

Algorithm 1 Unsupervised deep learning via progressive afﬁnity diffusion. Input: Training data I, training epochs Nep, iterations per epoch Nit; Output: A class discriminative CNN feature representation model; for epoch = 1 to Nep do Construct the k NN based afﬁnity graph (Eq (1)); Search strongly connected subgraphs (Fig 3); for iter = 1 to Nit do Mini-batch feed-forward through the network; Objective loss computation (Eq (5)); Network back-propagation and model weights update; Memory feature refreshing (Eq (6)). end for end for

Experiments Datasets. CIFAR10(/100): A natural image dataset containing 50,000/10,000 train/test images from 10 (/100) object classes. Image Net: A large scale 1,000 classes object dataset with 1.2 million images for training and 50,000 for test. SVHN: A Street View House Numbers dataset including 10 classes of digit images. STL10: An Image Net adapted dataset containing 500/800 train/test samples from 10 classes as well as 100,000 unlabelled images from auxiliary unknown classes. MNIST: A hand-written digits dataset with 60,000/10,000 train/test images from 10 digit classes. Evaluation protocol. We considered two test protocols for unsupervised learning of discriminative representations. (1) Image classiﬁcation (Caron et al. 2018; Wu et al. 2018), where the ground-truth class labels of the training images are used after unsupervised model learning for enabling image categorisation. We tested two classiﬁcation models, Linear Classiﬁer (LC) with conv5 features and Weighted k NN with FC features. LC was realised by a fully connected layer. The k NN classiﬁer is based on weighted voting of topk neighbours Nk as sc = i Nk δ(c, ci) wi where δ(c, ci) is a Dirac function returning 1 if label c = ci, and 0 otherwise. We computed the weight wi as wi = exp(si/τ) with τ = 0.1 the temperature parameter and si the cosine similarity. Without an extra classiﬁer learning post-process involved, k NN reﬂects directly the discriminative capability of the learned feature representations. (2) Image clustering (Ji, Henriques, and Vedaldi 2019; Xie, Girshick, and Farhadi 2016) where k-means is applied if needed for clustering the samples represented by any unsupervised model into the ground-truth number of clusters. To measure the accuracy, we adopted the clustering accuracy (Xie, Girshick, and Farhadi 2016) which measures the proportion of samples correctly grouped. Note that both the training and test datasets are utilised for model learning in the standard clustering setting, unlike the standard classiﬁcation setting where only the training dataset is used. Implementation details. We used Alex Net as the backbone (Krizhevsky, Sutskever, and Hinton 2012). We adopted SGD with Nesterov momentum at 0.9, the epoch number at 200, the initial learning rate at 0.03 with a decay of 0.1 every 40 epochs after the ﬁrst 80 ones. We set k = 5 (Eq (1)) for graph construction. The maximum size s of SCS is set to 10. We set the weight λ = 0.8 (Eq (5)) and the memory update rate η = 0.5 (Eq (6)). Data augmentation includes horizontal ﬂipping, cropping and colour jittering. In practice, the SCS searching is conducted on memory features for efﬁciency concern and the memory bank takes around 600MB for Image Net. Implemented in Tarjan framework (Tarjan 1972), the worst-case time complexity of our SCS searching is O(N 2). All the experiments run on Tesla P100 GPU. Hyper-parameters selection. All our parameters are tuned on CIFAR10 and applied to all the other datasets. There is no per-dataset tuning. In particular, due to no validation set for cross-validation, we used common parameter settings: (1) we set k (Eq (1)) and s (maximum SCS size) so that there are neither too many isolated samples nor too large SCS units for error control; set λ (Weight of hard positive enhancement) so that no loss dominates;

and set η (memory update rate) following (Wu et al. 2018; Huang et al. 2019). (2) For the SGD optimiser, we set the parameters so that the training loss can drop properly. This tuning process is easy to conduct in practice. Instead of exhaustively tuning the parameters, we used a single setting for all the experiments to test its scalability and generalisation.

Image Classiﬁcation

In Table 1, we compared our PAD method with the representative works of clustering analysis (Deep Cluster (Caron et al. 2018), AND (Huang et al. 2019)), self-supervised learning (Rot Net (Gidaris, Singh, and Komodakis 2018)), and sample speciﬁcity learning (Instance (Wu et al. 2018)) on four benchmarks. We make three observations: (1) PAD surpasses all competitors under either classiﬁcation model, often by a large margin. This suggests the performance superiority of our method thanks to its strong capability of discovering underlying semantic boundaries. (2) When compared with k NN classiﬁer, linear classiﬁer tends to yield better results due to using extra parameters. This is particularly so for the pretext task based model Rot Net. This is because the pretext task is less relevant to classiﬁcation, leading to weaker representation as compared to grouping based methods like AND and PAD. (3) As an intermediate representation between clusters and instances, tiny neighbourhoods are exploited in AND for revealing class boundaries and improves the results in most cases. However, this method is restricted by the small size of neighbourhoods. PAD addresses this limitation by afﬁnity diffusion across adjacent neighbourhoods.

Dataset CIFAR10 CIFAR100 SVHN Image Net Classiﬁer/Feature Weighted k NN / FC Random 34.5 12.1 56.8 3.5 Deep Cluster 62.3 22.7 84.9 26.8 Instance 60.3 32.7 79.8 31.3 Rot Net 72.5 32.1 77.5 9.2 AND 74.8 41.5 90.9 31.3 PAD 81.5 48.7 91.2 35.1 Classiﬁer/Feature Linear Classiﬁer / conv5 Random 67.3 32.7 79.2 14.1 Deep Cluster 77.9 41.9 92.0 38.2 Instance 70.1 39.4 89.3 35.6 Rot Net 84.1 57.4 92.3 36.5 AND 77.6 47.9 93.7 37.8 PAD 84.7 58.6 93.2 38.6

Table 1: Image classiﬁcation results of unsupervised learning models. The 1st/2nd best results are indicated in bold/underline.

We examined the training dynamics of SCS size and precision. Fig 4 shows that PAD starts with ﬁnding small SCS structures due to weak representation power, then explores larger ones at decreasing precision, and ﬁnally converges the size whilst increases the precision before both levels off. High precision of SCS is a key for enabling more discriminative unsupervised learning by PAD.

0 100 200 Training Epoch

0 100 200 Training Epoch

Precision (%)

Figure 4: Statistics dynamics of SCS during training on CIFAR10: average size and precision.

Image Clustering

Apart from sample-wise image classiﬁcation, we further tested the performance of our model on image clustering which reﬂects the representation quality in describing global data structures. We compared PAD with two groups of alternative methods, (1) Clustering methods: JULE (Yang, Parikh, and Batra 2016), DEC (Xie, Girshick, and Farhadi 2016), DAC (Chang et al. 2017), ADC (Haeusser et al. 2018) and IIC (Ji, Henriques, and Vedaldi 2019); (2) Generic representation learning methods: Triplets (Schultz and Joachims 2004), AE (Bengio et al. 2007), Sparse AE (Ng 2011), Denoising AE (Vincent et al. 2010), Variational Bayes AE (Kingma and Welling 2013), SWWAE (Zhao et al. 2015), DCGAN (Radford, Metz, and Chintala 2016) and Deep Cluster (Caron et al. 2018). For the latter group including PAD, we further applied k-means to generate their clustering solutions. For PAD, we reported the average result over 10 runs. For competitors, we used the results from (Ji, Henriques, and Vedaldi 2019). Despite different modelling purposes, we performed both within and cross group comparisons. We made a couple of observations from Table 2: (1) The ﬁrst group of methods tends to produce higher clustering results, thanks to their joint learning of feature representations and clustering by using the ground-truth class number prior in end-to-end model training, i.e. consistent between training and test objectives. Among them, IIC achieves the best results. (2) Without taking clustering as objective, the second group of methods is relatively inferior in modelling data group structures. However, PAD again reaches the best performance consistently in this group. Crucially, our model is on par with all the dedicated clustering methods and even surpasses them on CIFAR10/100 with signiﬁcant margins, regardless of the disadvantage on STL10 which, we conjugate, is due to some distracting impact from auxiliary unknown categories. This indicates the efﬁcacy of our unsupervised learning method in capturing the holistic data distribution. We attribute this advantage to the favourable ability of our method in seeking the latent class consistent groups with high variations of individual concepts.

Component Evaluations and Further Analysis

To provide insights into our model, we conducted a sequence of detailed component evaluations and performance analysis on the image classiﬁcation task with k NN as classiﬁer. Effect of afﬁnity diffusion. To investigate the effect of afﬁnity diffusion, we tested the performance of models in which C(x) for each sample was replaced by its k-nearest neigh-

Methods MNIST STL10 CIFAR10 CIFAR100 JULE 96.4 27.7 27.2 13.7 DEC 84.3 35.9 30.1 18.5 DAC 97.8 47.0 52.2 23.8 ADC 99.2 53.0 32.5 16.0 IIC 98.4 59.8 57.6 25.5 Random CNN 48.1 20.1 18.6 10.3 Triplets 52.5 24.4 20.5 9.9 AE 81.2 30.3 31.4 16.5 Sparse AE 82.7 32.0 29.7 15.7 Denoising AE 83.2 30.2 29.7 15.1 Variational Bayes AE 83.2 28.2 29.1 15.2 SWWAE 82.5 27.0 28.4 14.7 DCGAN 82.8 29.8 31.5 15.1 Deep Cluster 65.6 33.4 37.4 18.9 PAD (Ours) 98.2 46.5 62.6 28.8

Table 2: Comparing image clustering results of unsupervised learning methods. : Used k-means. The results of existing methods are adopted from (Ji, Henriques, and Vedaldi 2019).

bours Nk(x) as shown in Fig 2(a) and the size k was set to 10. According to Table 3, discovering consistent sample groups according to afﬁnity diffusion across adjacent neighbourhoods with necessary constraints clearly beneﬁts the discriminative learning of models in all cases.

Diffusion CIFAR10 CIFAR100 SVHN 77.5 34.5 89.5 81.5 48.7 91.2

Table 3: Effect of afﬁnity diffusion.

Cyclic and scale constraints. We examined the effect of cyclic and scale constraints and observed from Table 4 that: (1) Cyclic constraint brings consistently performance gain, particularly in the most challenging CIFAR100 test. This is because of the presence of subtle visual discrepancy between ﬁne-grained classes, which leads to more wrong association in afﬁnity diffusion. (2) Scale constraint is clearly necessary for ensuring the effectiveness of our model in all cases. Without it, different classes would be mixed up in diffusion due to complex visual patterns exhibited in images.

Cyclic CIFAR10 CIFAR100 SVHN 73.3 30.6 91.1 81.5 48.7 91.2 Scale CIFAR10 CIFAR100 SVHN 20.2 1.8 20.3 81.5 48.7 91.2

Table 4: Effect of (Top) cyclic and (Bottom) scale constraints in forming SCS structures.

To provide visual interpretation, we conducted a case study of afﬁnity diffusion on STL10. Fig 5 shows that when the model is immature, wrong cross-class diffusion may happen frequently; the cyclic constraint can help detect this and early stop error accumulation. Besides, it is shown that SCS can better capture semantic similarity beyond pairwise

0.824 0.834

0.812 0.760

0.793 0.661

(a) w/o Cyclic (b) Cyclic (c)

Figure 5: Case study on STL10. (a) vs. (b): Cyclic constraint helps detect erroneous afﬁnity diffusion. (c) Extremely hard positive pairs undiscovered by PAD. Red solid box: with a different class as the anchor; Green solid box: with the same class as the anchor. The numbers above arrows are the corresponding pairwise afﬁnity scores.

afﬁnity measurements (see dashed curve). However, as expected not all the hard positive pairs are discovered due to extreme viewing condition discrepancy as shown in Fig 5(c). Hard positive enhancement. Table 5 (top) shows that constraining the prediction between hard positive pairs is clearly beneﬁcial for model discriminative learning. This conﬁrms the overwhelming effect among within-SCS samples when using normalised afﬁnity measurements (Eq (2)) to quantify loss function (Eq (3)), and suggests the efﬁcacy of our enhancement strategy. We also compared two loss designs: feature cosine similarity (FCS) vs. Kullback-Leibler (KL) divergence. Table 5 (bottom) suggests the superiority of KL over FCS. A plausible explanation is that KL can integrate with SCS loss Lscs (Eq (3)) in a more harmonious manner, as both are based on class posterior probability measurements.

HPE CIFAR10 CIFAR100 SVHN 69.8 30.9 80.9 81.5 48.7 91.2 Design CIFAR10 CIFAR100 SVHN FCS 72.7 39.1 90.4 KL 81.5 48.7 91.2

Table 5: (Top) Effect of hard positive enhancement (HPE) and (Bottom) the HPE loss design comparison of feature cosine similarity (FCS) and Kullback-Leibler (KL) divergence.

Parameter analysis. We evaluated 3 parameters of PAD on CIFAR10: (1) Afﬁnity graph density k, (2) SCS scale threshold s, and (3) weight λ of hard positive enhancement loss. Table 6 shows that the parameters are insensitive with a wide range of good values, indicating training robustness. Computation cost of SCS searching. To validate the complexity of SCS searching, we tested the searching time on Image Net: 4 minutes per epoch, i.e. 800 minutes among the overall 6 days training.

Conclusion In this work, we presented a novel Progressive Afﬁnity Diffusion (PAD) method for discriminative unsupervised deep learning. It is achieved by self-discovering class consistent strongly connected subgraphs in neighbourhood afﬁn-

k 1 3 5 10 Accuracy 78.3 79.3 81.5 79.7 s 5 10 50 100 Accuracy 80.2 81.5 80.7 73.5 λ 0.2 0.5 0.8 1.0 Accuracy 78.4 78.5 81.5 78.9

Table 6: Model parameter analysis on CIFAR10. Top: Afﬁnity graph sparsity k; Middle: SCS scale threshold s; Bottom: Weight λ of hard positive enhancement (HPE) loss.

ity graphs and formulating group structure aware objective loss function. This model can be trained end-to-end in a progressive multi-stage manner. Critically, PAD overcomes the notorious error propagation of clustering and the small locality limitation of neighbourhoods, whilst preserving and integrating their intrinsic strengths for more effective discriminative learning. Extensive experiments on image classiﬁcation and image clustering tasks validate the superiority of PAD over a wide spectrum of state-of-the-art unsupervised deep learning methods.

Acknowledgements This work was supported by the China Scholarship Council, Vision Semantics Limited, the Alan Turing Institute Turing Fellowship, and the Innovate UK Industrial Challenge Project on Developing and Commercialising Intelligent Video Analytics Solutions for Public Safety (98111571149)

References Aggarwal, C. C., and Reddy, C. K. 2013. Data clustering: algorithms and applications. CRC press. Belkin, M.; Niyogi, P.; and Sindhwani, V. 2006. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR 7(Nov):2399 2434. Bengio, Y.; Lamblin, P.; Popovici, D.; and Larochelle, H. 2007. Greedy layer-wise training of deep networks. In Proc. Neur IPS, 153 160. Bojanowski, P., and Joulin, A. 2017. Unsupervised learning by predicting noise. In Proc. ICML, 1 10.

Caron, M.; Bojanowski, P.; Joulin, A.; and Douze, M. 2018. Deep clustering for unsupervised learning of visual features. In Proc. ECCV, 1 18. Chang, J.; Wang, L.; Meng, G.; Xiang, S.; and Pan, C. 2017. Deep adaptive image clustering. In Proc. ICCV, 5879 5887. Coates, A.; Ng, A.; and Lee, H. 2011. An analysis of single-layer networks in unsupervised feature learning. In Proc. AIStats, 215 223. Dahun Kim, Donghyeon Cho, I. S. K. 2018. Self-supervised video representation learning with space-time cubic puzzles. In Proc. AAAI. Doersch, C.; Gupta, A.; and Efros, A. A. 2015. Unsupervised visual representation learning by context prediction. In Proc. ICCV, 1422 1430. Donahue, J.; Kr ahenb uhl, P.; and Darrell, T. 2016. Adversarial feature learning. In Proc. ICLR, 1 18. Dong, Q.; Gong, S.; and Zhu, X. 2017. Multi-task curriculum transfer deep learning of clothing attributes. In WACV. Dong, Q.; Gong, S.; and Zhu, X. 2018. Imbalanced deep learning by minority class incremental rectiﬁcation. TPAMI. Dong, Q.; Zhu, X.; and Gong, S. 2019. Single-label multi-class image classiﬁcation by deep logistic regression. AAAI. Duda, R. O.; Hart, P. E.; and Stork, D. G. 2012. Pattern classiﬁcation. John Wiley & Sons. Gidaris, S.; Singh, P.; and Komodakis, N. 2018. Unsupervised representation learning by predicting image rotations. In Proc. ICLR, 1 16. Girshick, R. 2015. Fast r-cnn. In Proc. ICCV, 1440 1448. Goldberger, J.; Hinton, G. E.; Roweis, S. T.; and Salakhutdinov, R. R. 2005. Neighbourhood components analysis. In Proc. Neur IPS, 513 520. Goodfellow, I.; Bengio, Y.; and Courville, A. 2016. Deep learning. MIT press. Haeusser, P.; Plapp, J.; Golkov, V.; Aljalbout, E.; and Cremers, D. 2018. Associative deep clustering: Training a classiﬁcation network with no labels. In Proc. GCPR, 18 32. Springer. Hinton, G.; Vinyals, O.; and Dean, J. 2014. Distilling the knowledge in a neural network. Proc. Neur IPS. Huang, J.; Dong, Q.; Gong, S.; and Zhu, X. 2019. Unsupervised deep learning by neighbourhood discovery. In Proc. ICML. Ji, X.; Henriques, J. F.; and Vedaldi, A. 2019. Invariant information distillation for unsupervised image segmentation and clustering. In Proc. ICCV, 1 10. Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114. Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classiﬁcation with deep convolutional neural networks. In Proc. Neur IPS, 1097 1105. Le Cun, Y.; Bottou, L.; Bengio, Y.; Haffner, P.; et al. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278 2324. Lee, H.; Grosse, R.; Ranganath, R.; and Ng, A. Y. 2009. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proc. ICML, 609 616. Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional networks for semantic segmentation. In Proc. CVPR, 3431 3440.

Lucas, J. M., and Saccucci, M. S. 1990. Exponentially weighted moving average control schemes: properties and enhancements. Technometrics 32(1):1 12. Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning. In Neur IPS workshop on deep learning and unsupervised feature learning, number 2, 5. Ng, A. 2011. Sparse autoencoder. 1 19. Noroozi, M., and Favaro, P. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proc. ECCV, 69 84. Radford, A.; Metz, L.; and Chintala, S. 2016. Unsupervised representation learning with deep convolutional generative adversarial networks. In Proc. ICLR. Raghavan, V. V., and Yu, C. 1981. A comparison of the stability characteristics of some graph theoretic clustering methods. TPAMI (4):393 402. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. IJCV 115(3):211 252. Schultz, M., and Joachims, T. 2004. Learning a distance metric from relative comparisons. In Proc. Neur IPS, 41 48. Sun, C.; Shrivastava, A.; Singh, S.; and Gupta, A. 2017. Revisiting unreasonable effectiveness of data in deep learning era. In Proc. ICCV, 843 852. Tang, Y.; Salakhutdinov, R.; and Hinton, G. 2012. Robust boltzmann machines for recognition and denoising. In Proc. CVPR. Tarjan, R. 1972. Depth-ﬁrst search and linear graph algorithms. SIAM journal on computing 1(2):146 160. Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; and Manzagol, P.-A. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR 11(Dec):3371 3408. Wang, X.; He, K.; and Gupta, A. 2017. Transitive invariance for self-supervised visual representation learning. In Proc. ICCV, 1329 1338. Weston, J.; Chopra, S.; and Bordes, A. 2014. Memory networks. In Proc. ICLR. Wu, Z.; Xiong, Y.; Stella, X. Y.; and Lin, D. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proc. CVPR. Xie, J.; Girshick, R.; and Farhadi, A. 2016. Unsupervised deep embedding for clustering analysis. In Proc. ICML, 478 487. Yang, B.; Fu, X.; Sidiropoulos, N. D.; and Hong, M. 2017. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In Proc. ICML, 1 14. Yang, J.; Parikh, D.; and Batra, D. 2016. Joint unsupervised learning of deep representations and image clusters. In Proc. CVPR, 5147 5156. Zhang, R.; Isola, P.; and Efros, A. A. 2016. Colorful image colorization. In Proc. ECCV, 649 666. Zhang, R.; Isola, P.; and Efros, A. A. 2017. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Proc. CVPR, 1 11. Zhao, J.; Mathieu, M.; Goroshin, R.; and Lecun, Y. 2015. Stacked what-where auto-encoders. ar Xiv preprint ar Xiv:1506.02351.