# topdown_deep_clustering_with_multigenerator_gans__4e70e38f.pdf

Top-Down Deep Clustering with Multi-Generator GANs

Daniel P. M. de Mello 1, Renato M. Assunc ao 2, 1, Fabricio Murai 1

1 Universidade Federal de Minas Gerais, 2 Esri Inc. dani-dmello@hotmail.com, rassuncao@esri.com, murai@dcc.ufmg.br

Deep clustering (DC) leverages the representation power of deep architectures to learn embedding spaces that are optimal for cluster analysis. This approach filters out low-level information irrelevant for clustering and has proven remarkably successful for high dimensional data spaces. Some DC methods employ Generative Adversarial Networks (GANs), motivated by the powerful latent representations these models are able to learn implicitly. In this work, we propose HCMGAN, a new technique based on GANs with multiple generators (MGANs), which have not been explored for clustering. Our method is inspired by the observation that each generator of a MGAN tends to generate data that correlates with a sub-region of the real data distribution. We use this clustered generation to train a classifier for inferring from which generator a given image came from, thus providing a semantically meaningful clustering for the real distribution. Additionally, we design our method so that it is performed in a top-down hierarchical clustering tree, thus proposing the first hierarchical DC method, to the best of our knowledge. We conduct several experiments to evaluate the proposed method against recent DC methods, obtaining competitive results. Last, we perform an exploratory analysis of the hierarchical clustering tree that highlights how accurately it organizes the data in a hierarchy of semantically coherent patterns.

Introduction

Cluster analysis is a fundamental problem in unsupervised learning, with a wide range of applications, especially in computer vision (Shi and Malik 2000; Achanta and Susstrunk 2017; Joulin, Bach, and Ponce 2010; Liu et al. 2018). Its goal is to assign similar points of the data space to the same cluster, while ensuring that dissimilar points are placed in different clusters. One of the main challenges in this approach is to quantify the similarity between objects. For low-dimensional data spaces, similarity might be straightforwardly defined as the minimization of some geometric distance (e.g. euclidean distance, squared euclidean distance, Manhattan distance). On the other hand, choosing the distance metric becomes unfeasible for high dimensional data distributions. Images are a clear example of this problem, since any distance metric based on raw pixel spaces

Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

is subject to all sorts of low-level noisy disturbances irrelevant for determining the similarity and suffer from a lack of translation or rotation invariance. This motivates the need for some dimensionality reduction technique, by which the fundamental relationships between objects projected onto the resulting embedding space become more consistent with geometrical distances. In recent years deep clustering techniques have spearheaded the dimensionality reduction approach to clustering by employing highly non-linear latent representations learned by deep learning models (Krizhevsky, Sutskever, and Hinton 2012). Considering the unsupervised nature of cluster analysis, the models that naturally arise as candidates for deep clustering are unsupervised deep generative models, since these must learn highly abstract representations of the data as a requirement for realistic and diverse generated samples. One of such models are the Generative Adversarial Networks (GAN) (Goodfellow et al. 2014), whose extremely realistic results in image generation, semantic interpolation and interpretability in the latent space, are evidence of their capacity of learning a powerful latent representation that captures the essential components of the data distribution. Nonetheless, very few works have proposed GAN architectures designed for clustering. Some of these works are (Mukherjee et al. 2019; Chen et al. 2016), where the authors showed that, by manipulating the generator s architecture in a specific way, it is possible to control the class of the training data to which a generated sample belongs, even when classes labels are not available during the training. Some recent works employed a GAN architecture with multiple generators (Ghosh et al. 2018; Hoang et al. 2018; Zhang et al. 2018) to achieve greater diversity in image generation, as well as an alternative way of stabilizing the training. In these works, the authors have observed that each generator tends to specialize in generating examples belonging to a specific class of the dataset. This suggests the organic emergence of clusters in the generators representation of the data and it is one of the main motivations of our work. The rationale is that clustering would be possible by employing a classifier in charge of distinguishing between the generators, and this classifier could later be applied to the real dataset in order to classify real examples without the use of labels. Additionally, the problem of setting the number of generators to be used for representing the classes of the training set is

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

not addressed in these works, which could be an issue in a real clustering task, where the number of clusters is assumed to be unknown. In this work, we propose Hierarchical Clustering MGAN (HC-MGAN), a method that leverages the multi-generator GAN for the clustering task. We employ multiple generators, each of them specializing in representing a particular cluster of the training distribution. This should lead to a stronger representation capacity and with more meaningful clusters than what a single generator covering multiple clusters can provide. MGANs have not been used in the previous works exploring GANs for clustering. Additionally, we design our method so that it performs the clustering of the training data in a top-down hierarchical way, creating new generators as divisions of subsequent clusters become necessary, i.e., it permits the user to control different clustering granularity levels according to the task at hand. The main contributions of this work are as follows: We propose HC-MGAN, a novel deep clustering method that employs GANs with multiple generators. We design HC-MGAN as a top-down hierarchical clustering tree introducing the first hierarchical deep clustering algorithm. Hierarchical clustering allows the user to control different levels of cluster granularity and favors interpretability by taxonomically organizing the clusters. We conduct experiments with three image datasets used in other deep clustering works as benchmarks, namely, MNIST, Fashion MNIST and Stanford Online Products. We obtain competitive results against recent deep clustering methods, all of which are horizontal and therefore lack the advantages of the hierarchical approach. We explore the clustering pattern obtained throughout the tree, displaying how HC-MGAN is able to organize the data in a semantically coherent hierarchy of clusters.

Related Work We review some recent work in topics related to Deep Clustering and GANs. Autoencoders (AEs) have been the most prevalent choice in the deep clustering literature (Xie, Girshick, and Farhadi 2016; Yang et al. 2017, 2019; Zhang et al. 2019b), where the clustering objective is usually optimized on the feature representation Z resultant of the mapping E : X Z learned by the encoder component in AEs for a data space X and a feature space Z. Deep Embedded Clustering (DEC) (Xie, Girshick, and Farhadi 2016) pioneered this approach, obtaining state-of-the art clustering results. It works by pretraining an AE with a standard reconstruction loss function, and then optimizing it with a regularizer based on the clustering assignments modeled by a target studentt distribution, having the cluster centers iteratively updated. DEC s results were surpassed by (Yang et al. 2017), which converted DEC s objective function into an alternated joint optimization with K-Means loss, thus obtaining a clusterization more suited for K-Means. The use of GANs for clustering tasks has been influenced by Info GAN (Chen et al. 2016), a type of GAN whose latent variable consists of, besides the usual multidimensional variable z, a set c of one-dimensional variables c1, c2...c N

that are expected to unsupervisedly capture semantic information in a disentangled manner (i.e., with each variable encoding isolated interpretable features of the real data). For obtaining this, the authors of Info GAN proposed an additional term in the generator s loss function that maximized the mutual information I(c; G(z, c)) between a generated image G(z, c) and the latent variable c that originated it. The variables c could be chosen to represent both categorical and continuous features. Cluster GAN (Mukherjee et al. 2019) is an architecture appropriate for clustering tasks based on Info GAN. The generator learns to generate a certain class of the real distribution correlated with a given one-hot format for the latent variables c. To obtain this, they proposed to use an inference encoder network capable of performing the mapping E : X Z, which is the inverse of the generator s mapping and similar to an encoder s mapping for an autoencoder architecture. After the training, the encoder can be employed to classify real data samples according to the latent variable to which it is mostly correlated, thus providing the clustering. Two key differences of our work is that (i) we use a separate generator (not a discrete latent variable) to encode a cluster, enabling our method to discover clusters with more representation capacity, and that (ii) we obtain hierarchical clusters, unlike the previous methods. Kundu et al. (2019) propose the GAN-Tree framework, which slightly resembles our approach, since it also involves a hierarchical structure of independent nodes containing GANs capable of generating samples related to different levels of a similarity hierarchy. There are several differences, however. The most important is that the main motivation for GAN-Tree was a framework capable of addressing the tradeoff between quality and diversity when generating samples from multi-modal data. The authors claimed that their approach could be readily adapted for clustering tasks, but no definitive experiments with clustering benchmarks were provided. Other important difference lies in their splitting procedure, which was performed with a latent ˆz inferred by an encoder E for a sample image x, that is, ˆz = E(x). For each node of the tree, they decompose their prior for ˆz into a mixture of two Gaussians with shifted means. They determine the prior Gaussian component to which ˆz is more likely related, and then train the encoder to maximize the likelihood to this prior. For a clustering task, this approach would heavily rely on the assumption that the inference made by E, as well as the cluster encoding with the decomposed Gaussians in Z, will be sufficient to capture semantically meaningful clustering patterns. The split in our approach, on the other hand, is directly embedded into the GAN training, with each generator automatically learning to represent each cluster. Therefore, in our work the clustering semantic quality is directly tied to the GAN s well known representation learning capacity, and, in particular, to the tendency of different generators in MGANs to cover different areas of the training distribution with high semantic discrepancy.

Proposed Method: HC-MGAN First, we provide a bird s eye view of the proposed method. Then we describe how its two phases Raw Split and Refinement work in detail.

Figure 1: Hierarchical clustering tree overview on MNIST. Nodes are represented by membership vectors sk. Grids show images sampled from each distribution Pk. Each split divides the probability masses in sk into two new nodes.

Algorithm 1: Split Input: XData, sk 1: s(0) l , s(0) m raw split(XData, sk) 2: for t = 1, . . . , T do 3: s(t) l , s(t) m refinement(XData, s(t 1) l , s(t 1) m ) 4: return sl = s(T ) l , sm = s(T ) m

Overview of the Hierarchical Scheme For a given a collection of N training examples XData = {x1, . . . , x N}, our method constructs a binary tree of hierarchical clusters, iteratively creating nodes, from top to bottom. The k-th created node is represented by a vector sk RN, referred to as membership vector, where each sk,i = p(Zi = k | xi, θ) measures the probability of example i belonging to cluster k given xi and our model s parameters θ, i.e., a soft clustering approach. The initial s0 consists of an all-ones vector. Figure 1 depicts the initial growth of the tree with the MNIST dataset, for the first 7 nodes. The tree grows via a split: a soft clustering operation that takes as input a node sk, and divides its probability masses into two new nodes represented by membership vectors sl and sm, with sl + sm = sk. We decide which leaf to split next by taking the node sk with the largest total mass. The split mechanism consists of two phases: (i) raw split and (ii) refinement phase, as shown in Algorithm 1. Before we obtain the final sl and sm vectors, vector sk undergoes a raw split, which outputs the initial membership vectors s(0) l and s(0) m . In most cases, s(0) l and s(0) m are just rough estimates of how to split sk in two clusters. Hence, we use the refinement phase to get them progressively closer to what we expect the ideal soft clustering assignment to be. In Algorithm 1, we can see how the refinement transforms s(0) l and s(0) m into two new membership vectors s(1) l and s(1) m . This process is repeated for T refinement operations to yield the final result sl = s(T ) l and sm = s(T ) m . Note that s(t) m + s(t) l = sk for every t. Figure 2 shows the progressive improvement resulting from the refinement phase with a grid of 25 MNIST samples, for an sk used as a running example. Samples of each membership vector are shown below its label and color-coded (gray scale) according to their probability mass. In this example, 3 s and 5 s are the classes mostly associated with sk. We expect the final split result (sl and sm) to be a separa-

1st refinement

wrong: this 3 resembles a 5, so it got less prob than in

1st refinement

2nd refinmt.

2nd refinmt.

wrong: this 5 resembles a 3, so it got less prob than in

raw split probs in

Figure 2: (Best viewed in color) Split mechanism applied to running example sk from MNIST. Grids show samples color-coded (gray scale) according to their probability mass in each membership vector (white: 100%). sk is mostly associated with 3 s and 5 s. Raw split roughly separates both classes. Refinement iterations enhance the separation.

tion of the probability mass of 3 s and 5 s. After the raw split of sk, the probability mass of 3 s and 5 s is roughly divided between s(0) l and s(0) m , but the most ambiguous examples received low probability mass in the membership vector mostly associated to its class. After the first refinement, we observe that s(1) l and s(1) m provide a better separation of 3 s and 5 s. After another iteration, most of the probability mass of the 3 s (5 s) samples is in s(2) l (s(2) m ).

For the raw split, we use a two-generator MGAN architecture, which we adapt for binary clustering by leveraging the fact that each generator learns to specialize in generating samples from one sub-region of the real data distribution, typically correlated with a specific set of classes of the data. Figure 3 (Top) depicts the training of this MGAN for our running example. The MGAN components are: generators Gαk, Gβk, discriminator Dk and classifier Ck. We need each generator to specialize in sub-regions of sk. Hence, real data samples used for training them must reflect the sample distribution given by sk (in the example, these are mostly 3 s and 5 s). We define distribution Psk over the real examples by (sum-to-one) normalizing sk, and use it to sample the training batches. We train the MGAN with the usual adversarial game between generators Gαk, Gβk and the discriminator Dk, while Ck is trained to distinguish between generators. After a few epochs, we observe that one generator is generating mostly 3 s while the other, mostly 5 s. An important detail is that we also train the generators to minimize Ck s classification loss, increasing the incentive for Gαk and Gβk to generate samples from different sub-regions. Moreover, we share some parameters between Dk and Ck, since it is more desirable for Ck to perform its classification in a higher-level feature space, such as the one learned by Dk. After training the MGAN for enough epochs, we can cluster the data as shown in Figure 3 (Bottom). We use Ck to perform the clustering on the real data according to the two generated distributions it learned to identify, and which

Train this MGAN for some epochs...

Each generator specializes in a sub-region of minibatch has mostly 3 s and 5 s (dominant mass in )

Raw Split: Training

fake images

real images

Shared parameters

Raw Split: Clustering After training, clusterizes the real data according to the two generated distributions

probs in probs in

wrong: this 3 resembles a 5, so it got less prob than in

wrong: this 5 resembles a 3, so it got less prob than in

Figure 3: (Top) Training the Raw Split Components: generators Gαk, Gβk, discriminator Dk and classifier Ck. Psk draws real samples in proportion to their mass in sk. (Bottom) Clustering the dataset with the raw split classifier.

we expect to correlate with different classes of the dataset. In this example, the first generated distribution resembled 3 s, while the other resembled 5 s. Hence, we expect Ck to mostly assign real 3 s probability mass to the first soft cluster and real 5 s probability mass to the second. As seen in Figure 3 (Bottom), Ck does manage to roughly assign the 3 s (resp. 5 s) mass to the cluster related to Gαk s (resp. Gβk).

Note that Ck must be used to perform inference for all examples of every class in the dataset, regardless if they were present in the generated distributions, with p(G = Gαk|x) + p(G = Gβk|x) = 1. While Ck can be seen as classifying examples conditioned on them belonging to node s sk subtree, membership vectors actually correspond to estimates for unconditional probabilities. To enforce the condition that, for each example, the sum of its probability masses in the membership vectors of sk s children equals vector sk, we multiply the probabilities of each example predicted by Ck by its probability in sk.

Finally, by noting that some samples in the running example were not well separated in s(0) l and s(0) m by classifier Ck, we conclude that the two distributions obtained from a raw split may not be sufficiently diverse or have enough quality to account for the entire set of 3 s and 5 s the MGAN had access to. This underlines the need for the refinement phase, which will be covered in the next section.

We now formalize the MGAN game in the raw split phase. Dropping the subscript k to avoid clutter, we define the objective function of the two-generator MGAN a sum of two cost functions Ladv and Lcls, min Gα,Gβ,C max D L = Ladv(Gα, Gβ, D)+λLcls(Gα, Gβ, C), (1)

where Ladv is the cost for the adversarial minimax game

Algorithm 2: Raw Split Input: XData, sk 1: Creates Components Gαk, Gβk, Ck, Dk 2: for training iterations do 3: sample x, ˆxα, ˆxβ from Psk, PGαk, PGβk 4: Get L(real) Dk using Dk criterion on x with real labels

5: Get L(fake) Dk using Dk criterion on ˆxα, ˆxβ w/ fake labels

6: Update θDk with Adam( θDk (L(real) Dk + L(fake) Dk )) 7: Get LCk using Ck criterion on ˆxα, ˆxβ, with categorical labels for Gα, Gβ 8: Update θCk with Adam ( θCk (LCk))

9: Get L(disc) Gk using Dk criterion on ˆxα, ˆxβ with real labels

10: Get L(clasf) Gk using Ck criterion on ˆxα, ˆxβ with categorical labels for Gα, Gβ 11: Update θGαk,Gβk w/ Adam( θGk (L(disc) Gk + λL(clasf) Gk )) 12: for xi in XData do 13: sl,i C(α out) k (xi) sk,i, sm,i C(β out) k (xi) sk,i 14: return sl, sm

between generators and discriminator, given by Ladv(Gα, Gβ, D) = Ex Ps[log D(x)] + Ex PGα[log(1 D(x))] + Ex PGβ[log(1 D(x))] (2)

and Lcls is the classification cost that is minimized by both generators and classifier, given by Lcls(Gα, Gβ, C) = Ex PGα[log(C(x))]+Ex PGβ[log(C(x))]. (3) Note that we multiply Lcls by a regularization parameter λ to weight its impact on the total cost. Algorithm 2 lists the steps involved in training the MGAN for the raw split and the clustering performed with Ck.

The refinement phase comes right after the raw split. During this phase, some of the probability mass in the two membership vectors obtained by the raw split, s(0) l and s(0) m , is exchanged so as to iteratively improve the clustering quality (see Figure 2). Each iteration is referred to as a refinement sub-block. Without loss of generality, consider the first refinement sub-block, whose components are depicted in Figure 4 (Top), for our running example. The components are divided in two refinement groups , l and m (each formed by a generator Gj, a discriminator Dj and a classifier Cj, j {l, m}). Each group j takes s(0) j as input, and has its own independent GAN game occurring between Gj and Dj. This scheme with two separated GANs is designed to obtain a more focused generative representation of each subregion of sk than we were able to obtain at the raw split phase using a single MGAN s discriminator to learn to discriminate the entire region described by sk. By providing a more focused view of one sub-region to one discriminator, it encounters less variance among the real examples it receives, and thus its discriminative task becomes easier. We expect its adversarial generator s response to be a more diverse and convincing generation of examples associated with that particular sub-region.

Refinement: Training

Shrd. paramt.

real images

fake images

minb. has mostly 3 s

Each discriminator now has a more focused view of each sub-region, so the generators become more realistic/diverse

Alternatingly train these 2 GANs...

Refinement: Clustering Re-estimation

Previous clusters are refined by re-estimating the membership probs. with the classifiers trained on the new higher quality gen. distributions

fake images

real images

minb. has mostly 5 s Shrd. paramt.

probs in probs in

Figure 4: (Top) Refinement Training: generators Gl, Gm, discriminators Dl, Dm and classifiers Cl, Cm. (Bottom) Clustering the dataset with the refinement classifiers.

As shown in Figure 4 (Top), each GAN in groups j {l, m} draws real samples from its corresponding distribution Psj over XData, which is equal to the (sum-to-one) normalized vector sj (analogously to Psk). As a result, the minibatches passed to each group s GAN reflect the probability mass in their respective membership vectors, e.g.: in our running example, Psl draws mostly 3 s, since 3 s have more probability mass in s(0) l , but it might eventually draw some 5 s as well, since there s still some mass for 5 s in s(0) l . The role of classifiers Cl and Cm in these two separated GAN games is similar to the single classifier Ck used in the raw split phase: learn to distinguish samples from Gl and Gm, thereby providing a way to cluster the real data. However, instead of using a single classifier (as in the raw split), we found that using two separate classifiers which share parameters with the respective discriminators forces the classification to occur in a higher-level feature space, achieving better clustering. Furthermore, we also train Gl and Gm to minimize both classifiers losses, thus providing incentive for each generator to generate samples more strongly correlated with each sub-region (we did something similar for the 2 generators in the MGAN of the raw split). After training the two refinement groups alternately for enough epochs, we perform a clustering re-estimation, as shown in Figure 4 (Bottom). This is similar to the way Ck was used to cluster the real data in the raw split (Figure 3 (Bottom)). Now, both classifiers estimate the probability that a sample xi XData came from Gl (or its complement, Gm), hereby denoted by C(l out) j (xi), for j {l, m}. We

take the average probability (C(l out) l (xi) + C(l out) m (xi))/2 as the proportion of the mass associated with xi in sk that should go into s(1) l (complement to s(1) m ). By using this twoclassifier ensemble , we aim to increase the quality of the

clustering, since the generators distributions are expected to be more representative of each sub-region of the dataset than the two generated distributions obtained during the raw split. In the running example, Gl s (Gm s) distributions resembled 3 s (5 s), so the classifiers assign more of the 3 s (5 s) probability mass to the cluster associated with Gl (Gm s). For the subsequent refinement, we expect that using s(1) l and s(1) m to train new refinement groups l and m can yield generated distributions that are even more representative of the sub-regions encoded by these membership vectors, providing, in turn, even more information for the classifiers to perform the clustering and to obtain improved membership vectors s(2) l and s(2) m (recall Figure 2). Therefore, by repeating the process over T refinements, it is expected that the initial sub-regions captured by sk are increasingly more associated with either of the refinement groups. We now formalize the two simultaneous GAN games for the refinement training. From the perspective of refinement group l, the training can be defined as an optimization of a sum of two cost functions Ladv and Lcls, min Gl,Cl max Dl L(Gl, Dl, Cl) = Ladv(Gl, Dl) + λLcls(Gl, Cl), (4)

where Ladv describes the cost function for the adversarial minimax game between generator Gl and discriminator Dl, that only involves group l components, and is given by Ladv(Gl, Dl) = Ex Pdata[log Dl(x)]+Ex PGl [log(1 Dl(x))] (5) and Lcls is classification cost that is minimized with respect to Gl s parameters and Cl s parameters, but also involves Gm and Cm for computing the cost, given by Lcls(Gl, Cl) = Ex PGl [log Cl(x)] + Ex PGl [log Cm(x)]

+ Ex PGm [log Cl(x)]. (6) We multiply Lcls by a regularization parameter λ to weight its impact on the total cost. The corresponding equations for refinement group m follow analogously. Algorithms 3 and 4 respectively list the steps involved in the external training loop that coordinates the alternated training of refinement groups l and m, and in the function called by Algorithm 3 that performs an update for the components of a given refinement group isolatedly.

Experiments

As unsupervised tasks forbid hyperparameter tuning, we used only slightly different tunings for each dataset, none of which required labeled supervision, merely aiming to stabilize the generators and avoid overfitting the classification of generators. Code available at github.com/dmdmello/HC-MGAN and supplementary/implementation details at arxiv.org/abs/2112.03398. We consider three datasets: MNIST (Le Cun et al. 1998), Fashion MNIST (FMNIST) (Xiao, Rasul, and Vollgraf 2017) and Stanford Online Products (SOP) (Oh Song et al. 2016). Following a common practice in DC works, we used all available images for each dataset. MNIST: This dataset consists of grayscale images of hand-written digits with 28x28 resolution, with 10 classes and approximately 7k im-

Algorithm 3: Refinement

Input: XData, s(t) l s(t) m 1: Creates Gl, Dl, Cl for group l and Gm, Dm, Cm for group m 2: for T iterations do 3: sample xl, xm, ˆxl, ˆxm from Ps(t) l , Ps(t) m , PGl, PGm 4: Train Refin Group(Gint = {Gl, Dl, Cl, xl, ˆxl}, Gext = {Cm, ˆxm}) {trains l with needed external data/components from m} 5: Train Refin Group(Gint = {Gm, Dm, Cm, xm, ˆxm}, Gext = {Cl, ˆxl}) {trains m with needed external data/components from l} 6: for xi in XData do 7: s(t+1) i,l (C(l out) l (xi) + C(l out) m (xi)) (s(t) i,l + s(t) i,m)/2

8: s(t+1) i,m (C(m out) l (xi) + C(m out) m (xi)) (s(t) i,l + s(t) i,m)/2

9: return s(t+1) l , s(t+1) m

Algorithm 4: Train Refin Group Input: Gint = {Gint, Dint, Cint, x, ˆxint}, Gext = {Cext, ˆxext} #Gint receives internal data/components from the current refinement group being trained, Gext receives external data/components from the neighbor refin. group needed to train the current group

1: Get L(real) Dint using Dint criterion on x with real labels

2: Get L(fake) Dint using Dint criterion on ˆxint with fake labels

3: Updates θDint with Adam( θDint (L(real) Dint + L(fake) Dint )) 4: Get LCint using Cint criterion on ˆxint, ˆxext, with categorical labels for internal and external generated data 5: Updates θCint with Adam( θCint (LCint))

6: Get L(disc) Gint using Dint criterion on ˆxint with real labels

7: Get L(clasf) Gint using Cint criterion on ˆxint, ˆxext with categorical labels for internal and external generated data 8: Updates θGint with Adam( θGint (L(disc) Gint + λL(clasf) Gint ))

ages available for each class. FMNIST: This dataset consists of gray scale pictures of clothing-related objects with 28x28 resolution, with 10 classes and exactly 7k images available for each class. SOP: This dataset consists of color pictures of products with varying resolution sizes, with 12 classes, and a varied number of examples per class, roughly ranging from 6k examples to 13k. SOP, in particular, is designed for supervised tasks, and is very hard for clustering due to high intra-class variance. We follow the practice adopted for SOP in (Zhang et al. 2019b), i.e., we convert the images to grayscale, resize them to 32x32, and remove the classes kettle and lamp .

Evaluation Metrics We consider two of the most common clustering metrics for benchmarks with each dataset: clustering accuracy (ACC) and normalized mutual information (NMI). As usual, these metrics are computed on the results obtained when setting the number of clusters C to the number of classes in the data. For a direct comparison, we let HC-MGAN s tree grow until it reaches C leaves. Additionally, we convert the final soft clustering outputs to hard assignments by attributing each example to the highest probability group, and then compute the evaluation metrics.

Baselines and SOTA methods We present the baselines and state-of-the-art methods used in the comparison. We consider five groups of methods: (i) classical, non Deep Learning (DL)-based based: K-Means (Mac Queen 1967), SC (Zelnik-Manor and Perona 2004), AC (Gowda and Krishna 1978), NMF (Cai et al. 2009); (ii) Varied DC: DEC (Xie, Girshick, and Farhadi 2016), DCN (Yang et al. 2017), JULE (Yang, Parikh, and Batra 2016), Va DE (Jiang et al. 2017), DEPICT (Ghasedi Dizaji et al. 2017), Spectral NET, DAC (Chang et al. 2017), (Shaham et al. 2018), Dual AE (Yang et al. 2019); (iii) Subspace Clustering (either DLbased or not): NCSC (Zhang et al. 2019b), SSC (Elhamifar and Vidal 2013), LLR (Liu et al. 2013), KSSC (Patel and Vidal 2014), DSC-Net (Ji et al. 2017), k-SCN (Zhang et al. 2019a); (iv) GAN-based DC: Cluster GAN (Mukherjee et al. 2019), Info GAN (Chen et al. 2016), DLS-Clustering (Ding and Luo 2019);(v) DC w/ data augumentation: (Zhang et al. 2019b), IIC (Ji, Henriques, and Vedaldi 2019), DCCM (Wu et al. 2019) and DCCS (Zhao et al. 2020). None of the DC methods we found in the literature are hierarchical. Most results are transcribed from either (Zhao et al. 2020), (Zhang et al. 2019b) or (Mukherjee et al. 2019).

Results Table 1 shows the performance comparison between traditional baselines, state-of-the-art DC methods without and with data augumentation, and our method. In order to check the effectiveness of the refinements, we also display results obtained only with raw splits. MNIST Our method outperforms the traditional baselines by a large margin. In terms of ACC, it is not among the top 5 presented methods, but it performs reasonably close to them, even outperforming, in either NMI or ACC, some DC methods like Cluster GAN, Info GAN, DEC, DCN. This result was obtained by employing the same architecture we used for FMNIST, which might be of excessive capacity for MNIST and even harm the clustering result by making it trivial for a single generator to represent all the data, instead of a cluster of similar data points, during the raw split. This dataset was the one for which the refinements showed the greatest improvement over the raw split only experiment. FMNIST Only DCCS is able to surpass our method s ACC and NMI performance. We emphasize that using data augmentation in DCCS causes a significant improvement, since selecting the right type of augmentations for a specific dataset can reduce much of the intra-class variance. However, augmentation with GANs is challenging (Karras et al. 2020), so we leave this for future work. The refinements still had a positive impact over the raw split only experiment. SOP The results for existing methods were transcribed from the NCSC work (Zhang et al. 2019b). There the authors state that they handpicked 1k examples per class to create a manageable dataset for clustering, but did not specify how or which examples were selected. Their choice was made in the context of competing subspace clustering methods to which their model was being compared, many of which are not able to scale to larger datasets due to the memory constraints involved in computing a similarity matrix necessary for their methods. We tried without success to contact the authors to obtain the same subset of the data. Therefore, we decided to

Dataset MNIST FMNIST SOP Metrics ACC NMI ACC NMI ACC NMI K-meansi .572 .500 .474 .512 - - SCi .696 .663 .508 .575 - - ACi .695 .609 .500 .564 - - NMFi .545 .608 .434 .425 - - DECii .843 .772 .590 .601 .229 .121 DCNii .833 .809 .587 .594 .213 .084 JULEii .964 .913 .563 .608 - - Va DEii .945 .876 .578 .630 - - Spectral Netii .971 .924 .533 .552 - - Dual AEii .978 .941 .662 .645 - - DACii .966 .967 .615 .632 .231 .098 SSCiii .430 .568 .359 .181 .127 .007 LLRiii .552 .665 .345 .254 .224 .173 DSC-Netiii .659 .730 .606 .617 .269 .146 KSSCiii .585 .677 .382 .197 .268 .152 k-SCNiii .871 .782 .638 .620 .229 .166 NCSCiii .941 .861 .721 .686 .275 .138 Clust GANiv .950 .890 .630 .640 - - Info GANiv .890 .860 .610 .590 .198 .082 DLS-Clstiv .975 .936 .693 .669 - - IICv .992 .978 .657 .637 - - DCCMv - - .657 .637 - - DCCSv .989 .970 .756 .704 - - Ours .943 .905 .721 .691 .229 .072 Ours (raw) .877 .856 .704 .690 .204 .063

Table 1: Clustering performance results on 3 datasets w.r.t. ACC and NMI (top 5 in bold). Ours indicates HC-MGAN, with (raw) indicating no refinement operations performed. i: non-deep. ii: varied DC methods. iii: Subspace Clustering. iv: GAN-based DC. v: DC w/ data augmentation.

evaluate our results on the entire data. Due to the class imbalance, we compute both the mean ACC over classes (reported in the table) and the overall ACC achieved by HCMGAN, which are, respectively, 0.229 and 0.221. NMI is invariant to class imbalance. On this dataset, our model does not perform as well as the subspace clustering group, especially NCSC, KSSC and DSC-Net, even though it was on par with NCSC and greatly surpassed the KSSC and DSCNet on other datasets. But for other DC methods reported from (Zhang et al. 2019b), our method performs closely, even outperforming DCN and Info GAN in accuracy.

Qualitative Analysis Figure 5 shows a visualization of the clustering tree for FMNIST. For each node k in the tree, it displays a grid of 25 real examples sampled from a multinational distribution over the dataset with weights for the i-th example given by sk,i, yielding a representative view of sk s soft cluster. Moreover, a color coded bar graph beside each grid displays the % of the global probability mass for each class A in sk, given by 100 PN i sk,i 1(ci = A)/ PN i 1(ci = A), where 1 is an indicator function and ci is the class of the i-th example. Through a quick inspection, we observe that the 1st split (s0) occurred with almost perfect precision, with examples of the same class having their prob-

s17 s18 s13 s14

3rd split 2nd split

s11 s12 s16 s9 s10 s7 s8 s15 6th split 8th split 5th split 4th split

tshirt pants pullover dress coat sandal shirt sneakers bag ank. boot

tshirt shirt

pants dress sandal sneakers bag ank. boot

Figure 5: FMNIST cluster tree. Grids show sampled images; bar graphs beside k-th node s grid represent % of total probability mass for each class in that node (best viewed in color).

ability mass nearly entirely allocated to either s1 or s2. We begin to notice some imprecision at the 2nd split (s2), with a small portion of the probability mass of classes coat and t-shirt being assigned to s3 while the largest portion went to s4, and at the 3rd split (s1), with a small portion of the sneakers mass being assigned to s5 and the largest portion going to s6. The most imprecise splits occurred during the 4th, 7th and 9th splits (i.e., s4, s7 and s8), but the classes involved in these splits (t-shirt, pullover, coat and shirt) are the most visually similar in the dataset, thus being the hardest to accurately separate into clusters. The other classes are relatively well separated. One fact that can explain why our method performed very well w.r.t. NMI on FMNIST is that most of the probability mass of each class for which our method exhibited low accuracy was assigned to at most two clusters. For the NMI metric, having the wrongfully assigned classes concentrated on fewer clusters provides better mutual information than having them spread throughout more clusters.

We proposed a method for hierarchical clustering, leveraging the representation capacity of GANs/MGANs. To the best of our knowledge, this is the first hierarchical DC method. It creates a tree of clusters from top to bottom, where each leaf node defines a cluster. Clusters are divided in binary splits, performed in two phases: a raw split and a refinement. We have shown how well our method compares to other DC methods, which lack the advantages of hierarchical clustering, obtaining competitive benchmark results.

Acknowledgments

The authors thank the partial support from Brazilian research supporting agencies: CNPq (grant PQ 313582/20181), FAPEMIG (grant CEX - PPM-00598-17, grant APQ02337-21), and CAPES (grant 88887.506739/2020-00).

Achanta, R.; and Susstrunk, S. 2017. Superpixels and Polygons Using Simple Non-Iterative Clustering. In Conference on Computer Vision and Pattern Recognition (CVPR). Cai, D.; He, X.; Wang, X.; Bao, H.; and Han, J. 2009. Locality Preserving Nonnegative Matrix Factorization. In International Jont Conference on Artifical Intelligence (IJCAI). Chang, J.; Wang, L.; Meng, G.; Xiang, S.; and Pan, C. 2017. Deep Adaptive Image Clustering. In IEEE International Conference on Computer Vision (ICCV). Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; and Abbeel, P. 2016. Info GAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. In Advances in Neural Information Processing Systems (Neur IPS). Ding, F.; and Luo, F. 2019. Clustering by Directly Disentangling Latent Space. ar Xiv:1911.05210. Elhamifar, E.; and Vidal, R. 2013. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI). Ghasedi Dizaji, K.; Herandi, A.; Deng, C.; Cai, W.; and Huang, H. 2017. Deep Clustering via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization. In IEEE International Conference on Computer Vision (ICCV). Ghosh, A.; Kulharia, V.; Namboodiri, V.; Torr, P. H. S.; and Dokania, P. K. 2018. Multi-agent Diverse Generative Adversarial Networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative Adversarial Net. In Advances in Neural Information Processing Systems (Neur IPS). Gowda, K. C.; and Krishna, G. 1978. Agglomerative clustering using the concept of mutual nearest neighbourhood. Pattern Recognition. Hoang, Q.; Nguyen, T.; Le, T.; and Phung, D. 2018. MGAN: training generative adversarial nets with multiple generators. In International Conference on Learning Representations (ICLR). Ji, P.; Zhang, T.; Li, H.; Salzmann, M.; and Reid, I. 2017. Deep Subspace Clustering Networks. In International Conference on Neural Information Processing Systems (Neur IPS). Ji, X.; Henriques, J. F.; and Vedaldi, A. 2019. Invariant Information Clustering for Unsupervised Image Classification and Segmentation. In IEEE/CVF International Conference on Computer Vision (ICCV).

Jiang, Z.; Zheng, Y.; Tan, H.; Tang, B.; and Zhou, H. 2017. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering. In International Joint Conference on Artificial Intelligence (IJCAI). Joulin, A.; Bach, F.; and Ponce, J. 2010. Discriminative clustering for image co-segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR). Karras, T.; Aittala, M.; Hellsten, J.; Laine, S.; Lehtinen, J.; and Aila, T. 2020. Training Generative Adversarial Networks with Limited Data. In Advances in Neural Information Processing Systems (Neur IPS). Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Image Net Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems (Neur IPS). Kundu, J. N.; Gor, M.; Agrawal, D.; and Babu, R. V. 2019. GAN-Tree: An Incrementally Learned Hierarchical Generative Framework for Multi-Modal Data Distributions. In IEEE/CVF International Conference on Computer Vision (ICCV). Le Cun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278 2324. Liu, G.; Lin, Z.; Yan, S.; Sun, J.; Yu, Y.; and Ma, Y. 2013. Robust Recovery of Subspace Structures by Low-Rank Representation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI). Liu, Z.; Lin, G.; Yang, S.; Feng, J.; Lin, W.; and Goh, W. L. 2018. Learning Markov Clustering Networks for Scene Text Detection. In Conference on Computer Vision and Pattern Recognition (CVPR). Mac Queen, J. 1967. Some methods for classification and analysis of multivariate observations. In Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. Mukherjee, S.; Asnani, H.; Lin, E.; and Kannan, S. 2019. Cluster GAN: Latent Space Clustering in Generative Adversarial Networks. AAAI Conference on Artificial Intelligence (AAAI). Oh Song, H.; Xiang, Y.; Jegelka, S.; and Savarese, S. 2016. Deep Metric Learning via Lifted Structured Feature Embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Patel, V. M.; and Vidal, R. 2014. Kernel sparse subspace clustering. In IEEE International Conference on Image Processing (ICIP). Shaham, U.; Stanton, K. P.; Li, H.; Basri, R.; Nadler, B.; and Kluger, Y. 2018. Spectral Net: Spectral Clustering using Deep Neural Networks. In International Conference on Learning Representations (ICLR). Shi, J.; and Malik, J. 2000. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence (PAMI). Wu, J.; Long, K.; Wang, F.; Qian, C.; Li, C.; Lin, Z.; and Zha, H. 2019. Deep Comprehensive Correlation Mining for Image Clustering. In IEEE/CVF International Conference on Computer Vision (ICCV).

Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747. Xie, J.; Girshick, R.; and Farhadi, A. 2016. Unsupervised Deep Embedding for Clustering Analysis. In International Conference on Machine Learning (ICML). Yang, B.; Fu, X.; Sidiropoulos, N. D.; and Hong, M. 2017. Towards K-Means-Friendly Spaces: Simultaneous Deep Learning and Clustering. In International Conference on Machine Learning (ICML). Yang, J.; Parikh, D.; and Batra, D. 2016. Joint Unsupervised Learning of Deep Representations and Image Clusters. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Yang, X.; Deng, C.; Zheng, F.; Yan, J.; and Liu, W. 2019. Deep Spectral Clustering Using Dual Autoencoder Network. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Zelnik-Manor, L.; and Perona, P. 2004. Self-Tuning Spectral Clustering. In Advances in Neural Information Processing Systems (Neur IPS). Zhang, H.; Xu, S.; Jiao, J.; Xie, P.; Salakhutdinov, R.; and Xing, E. P. 2018. Stackelberg GAN: Towards Provable Minimax Equilibrium via Multi-Generator Architectures. ar Xiv:1811.08010. Zhang, T.; Ji, P.; Harandi, M.; Hartley, R.; and Reid, I. 2019a. Scalable Deep k-Subspace Clustering. In Asian Conference on Computer Vision (ACCV) 2018. Zhang, T.; Ji, P.; Harandi, M.; Huang, W.; and Li, H. 2019b. Neural Collaborative Subspace Clustering. In International Conference on Machine Learning (ICML). Zhao, J.; Lu, D.; Ma, K.; Zhang, Y.; and Zheng, Y. 2020. Deep Image Clustering with Category-Style Representation. In European Conference on Computer Vision (ECCV).