# multilevel_crossmodal_alignment_for_image_clustering__01a8fc54.pdf Multi-Level Cross-Modal Alignment for Image Clustering Liping Qiu*, Qin Zhang*, Xiaojun Chen , Shaotian Cai Shenzhen University, Shenzhen, China qiuliping2021@email.szu.edu.cn, {qinzhang, xjchen}@szu.edu.cn, cai.st@foxmail.com Recently, the cross-modal pretraining model has been employed to produce meaningful pseudo-labels to supervise the training of an image clustering model. However, numerous erroneous alignments in a cross-modal pretraining model could produce poor-quality pseudo labels and degrade clustering performance. To solve the aforementioned issue, we propose a novel Multi-Level Cross-Modal Alignment method to improve the alignments in a cross-modal pretraining model for downstream tasks, by building a smaller but better semantic space and aligning the images and texts in three levels, i.e., instance-level, prototype-level, and semantic-level. Theoretical results show that our proposed method converges, and suggests effective means to reduce the expected clustering risk of our method. Experimental results on five benchmark datasets clearly show the superiority of our new method. Introduction Image clustering which groups images into different clusters without labels is an essential task in unsupervised learning. Many methods are proposed to utilize the large-scale pre-training models such as Resnet (He et al. 2016) or Vi T (Dosovitskiy et al. 2020) to extract high-quality representations for image clustering (Ji, Vedaldi, and Henriques 2019; Li et al. 2021; Zhong et al. 2021; Wu et al. 2019; Van Gansbeke et al. 2020; Dang et al. 2021). Then, to unsupervised classification models, multiple indirect loss functions are used (e.g. sample relations (Chang et al. 2017a), invariant information (Ji, Vedaldi, and Henriques 2019; Li et al. 2021), mutual information (Wu et al. 2019) and entropy (Huang, Gong, and Zhu 2020; Van Gansbeke et al. 2020; Li et al. 2021). However, as pointed out in (Cai et al. 2023), the aforementioned techniques have difficulties in handling examples that are semantically different but visually comparable by focusing only on images. Recently, many vision-language pre-training (VLP) models have been developed to align images and texts into a unified semantic space (Li et al. 2019; Chen et al. 2020; Ramesh et al. 2021; Li et al. 2020; Radford et al. 2021; Jia et al. *These authors contributed equally. Corresponding author. Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: The nearest noun for images in STL10 from Word Net, where the image and text embeddings are obtained via CLIP. The green words correspond to the correct alignments, while the red words indicate incorrect alignments. 2021). To utilize the VLP models for image clustering, Cai et al. (Cai et al. 2023) proposed to use CLIP (Radford et al. 2021) to produce meaningful pseudo-labels and achieved significant improvements on a wide range of datasets in comparison with conventional image clustering methods. Li et al. (Li, Savarese, and Hoi 2022) also used CLIP for a zero-shot image classification task. The success of these methods suggests a promising direction for image clustering. However, as depicted in Figure 1, there are instances where the alignments between images and texts in CLIP may be incorrect for downstream tasks, resulting in subpar pseudo labels and poor clustering performance. SIC (Cai et al. 2023) simply uses CLIP to obtain the embeddings of images and texts and cannot deal with incorrect alignments. While MUST (Li, Savarese, and Hoi 2022) strives to optimize the image encoder in CLIP, its efficiency is hampered by using the pretraining task to update the image encoder, resulting in a slow process. To address the above problem, we propose a novel method, namely Multi-Level Cross-Modal Alignment (MCA), an efficient way to improve the alignments between images and texts in CLIP for clustering tasks. In general, our main contributions are as follows: We propose to use the hierarchical structure in Word Net (see Figure 1) to filter irrelevant words and construct a The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) smaller but better semantic space, thus reducing the affection of unrelated nouns for clustering. Our experimental findings demonstrate that it can reduce the number of words by up to 60% and significantly enhance clustering performance compared to SIC. We propose to optimize both image and text embeddings for downstream tasks, by aligning the images and texts at three levels, i.e., instance-level, prototype-level, and semantic-level. Our proposed method can better fix the incorrect alignments in CLIP for downstream tasks when compared to SIC and MUST. Theoretical findings demonstrate that our proposed method converges at a sublinear rate and offers effective strategies for lowering the expected clustering risk of our method. These findings will provide valuable guidance for the design of new image clustering methods. Experimental results on five benchmark datasets clearly show the superiority of our new method, especially when dealing with complex clusters. Related Work The early deep clustering methods simply combine representation learning and shallow clustering (Xie, Girshick, and Farhadi 2016; Yang et al. 2017; Tian, Zhou, and Guan 2017; Shaham and Stanton 2018). With the rapid development of the pre-training paradigm, many methods employ largescale pre-training models such as Resnet (He et al. 2016) or Vi T (Dosovitskiy et al. 2020) to extract high-quality representations and train a classification model, by maximizing the consistency between each image and its augmentations/neighbors (Ji, Vedaldi, and Henriques 2019; Li et al. 2021; Zhong et al. 2021; Wu et al. 2019; Van Gansbeke et al. 2020; Zhong et al. 2021; Dang et al. 2021), or generating pseudolabels (Wu et al. 2019; Van Gansbeke et al. 2020). However, as pointed out in (Cai et al. 2023), it is challenging for the aforementioned techniques to handle examples that are semantically different but visually comparable by only accessing visual information in images. Cross-modal clustering has made significant progress in recent years, which usually learns a shared subspace such that the mutual agreement between multiple modalities is maximized, by Canonical Correlation Analysis (CCA) (Gao et al. 2020) or mutual information optimization (Mao et al. 2021). However, these methods require image-text pairs as input, which may be cost-intensive to collect in real applications. Recently, vision-language pre-training (VLP) models that align multi-modal data in common feature space by different pre-training tasks have been proposed. For example, Visual Bert (Li et al. 2019), UNITER (Chen et al. 2020) and DALL-E (Ramesh et al. 2021) use language-based training strategies, including mask LM (Language Modeling) such as Masked Language/Region Modeling, or autoregressive LM such as image caption and text-grounded image generation. CLIP (Radford et al. 2021) and ALIGN (Jia et al. 2021) utilize cross-modal contrastive learning to align the visual and textual information into a unified semantic space. Since VLP captures the relationships among images and texts (low-level semantics), it is natural to utilize VLP models to compensate for the semantic information for better image clustering. Cai et al. (Cai et al. 2023) proposed to use CLIP (Radford et al. 2021) to generate meaningful pseudolabels for image clustering. Li et al. (Li, Savarese, and Hoi 2022) also proposed to use CLIP for zero-shot image classification tasks. Notation and Problem Definition Suppose we have an image dataset X = {x1, x2, . . . , xn} with n instances sampled i.i.d. from input space D, we can obtain the embeddings of these images as U = {u1, u2, . . . , un} where ui = e I(xi) Rd 1 is obtained via the image encoder e I(.) of CLIP, where d is the embedding dimension. To capture the semantic meaning of these images, we first introduce a noun vocabulary T = {t1, t2, . . . , tm} that includes m noun phrases sampled from Word Net (Miller 1995). Then we can obtain the embeddings of these m words as V = {v1, v2, . . . , vm} where vi = e T (si) Rd 1, si is a sentence like A photo of a {ti} and e T (.) is the text encoder of CLIP. Let c be the number of categories, our goal is to group the images in X into c clusters with the help of CLIP. Let f I(e I(X); ϕ) denotes the image classification network with parameters ϕ that maps an image xi into a soft cluster assignment probability vector qi Rc 1, and f S(e T (T ); θ) denotes the text classification network with parameters θ that maps a word ti into a soft cluster assignment probability vector pi Rc 1. Notably, e I and e T in CLIP are kept frozen during the training process. Method In this paper, we propose our method shown in Figure 2. This new method mainly consists of three components: 1) Semantic space construction builds a proper semantic space T , 2) Image consistency learning performs the consistency learning in image space, and 3) Multi-level crossmodal alignment aligns images and texts at three distinct levels: instance-level, prototype-level, and semantic-level. Semantic Space Construction Constructing a proper semantic space T from Word Net (Miller 1995) such that the images can be well represented by the words in T is very important for image clustering, because too small T may lose important relevant words but too large T may contain too many irrelevant noisy words. In this step, we first build a candidate semantic space W as 82,000 nouns in the Word Net dataset (Miller 1995). Considering an image dataset usually makes up a small part of semantics, we propose a twostep filtering strategy to construct a proper semantic space for an image dataset: 1) Uniqueness-based filtering selects γr nearest words for each of c image cluster centers obtained by kmeans of the most unique nouns whose uniqueness scores (Cai et al. 2023) are greater than a given hyperparameter ρu. 2) Hierarchy-based filtering employs the hierarchical structure in Word Net to further filter Wc to form The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Figure 2: The framework of MCA consists of three parts: (1) Semantic space construction. (2) Image consistency learning (3) Multi-level cross-modal alignment. The thickness of lines in adaptive instance-level alignment reflects the magnitude of attention scores. the final semantic space T . Let T = . Given an image xi, we find its nearest noun wi Wc and search its hierarchical structure from Word Net (Miller 1995) to form a hierarchical semantic tree. In general, the words in the lower layers provide more fine-grained information for distinguishing the images, while the words in the higher layers may be useless for clustering. Figure 1 shows, for example, mammal is the common parent of dog and cat and cannot distinguish the images in dog or cat . Therefore, we propose a hierarchy-based filtering strategy that filters out the top γh levels (excluding root node) and adds each of the remaining words into T if it is also in Wc. Image Consistency Learning Intuitively, an image and its nearest images may have similar soft cluster assignments. Therefore, we propose the following loss function for image consistency learning: LI(f I(e I(X); ϕ)) = 1 j=rn(N I k I (xi)) log q T i qj l=1 ql log ql where N I k I(xi) contains k I nearest images of xi and rn(N I k I(xi)) randomly selects a sample from N I k I(xi). The second item is the popular negative entropy loss for preventing trivial solutions that most samples belong to a small proportion of clusters, where ql = n is the average cluster assignment. η is a trade-off parameter. Multi-Level Cross-Modal Alignment When using a cross-modal pretraining model for image clustering, the main challenge is to rectify incorrect alignments between images and words in image data. In this paper, we propose a novel Multi-Level Cross-Modal Alignment method for this task, which is shown in Figure 2. Specifically, our method employs a three-level alignment approach. Firstly, at the Instance-level Alignment, each image is aligned with its neighboring texts. Secondly, the Prototypelevel Alignment aligns each image prototype with its nearest text prototype. Lastly, at the Semantic-level Alignment, each image is aligned with its neighboring texts in the semantic space. The detailed descriptions of these three alignment processes are provided below. Instance-level alignment: Given an image xi and its neighboring texts N S k S(xi), we propose the following contrastive loss function to facilitate the alignment: j=rn(N S k S (xi)) log exp(qi T pj/τia) Pm l=1,l =j exp(qi T pl/τia) where τia is a tempreature parameter. Prototype-level alignment: Instancel-level alignment may be affected by noisy neighborhood relationships, so we further propose to align images and texts at prototype level which is more robust to noisy texts. We first compute an image prototype set HI, where h I l HI is computed as h I l = 1 ql 1 Pn i=1 qilui. Then, for each image prototype h I l HI, we can identify the word in T that is closest to h I l and finally construct a prototype set HS. To further improve the prototypes in HS, we find kp nearest neighborhoods for each h S l HS to compute the prototype of these neighborhoods and replace h S to update HS. Finally, to align images and texts at the prototype level, The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) we propose the following loss function: j=1 log exp(f I(h I j, ϕ)T f S(h S j , θ)/τpa) Pc l=1,l =j exp(f I(h I j, ϕ)T f S(h S l , θ)/τpa) where τpa is a tempreature parameter. Semantic-level alignment: Given an image xi and its neighbouring texts N S k S(xi), we let vj = e T (sj) be the the embeddings of tj N S k S(xi) and sj is a sentence like A photo of a {tj} , and pj = f S(vj, θ). It makes sense that the neighboring texts of an image can help determine the cluster assignment of this image. Note that the alignment relationships between images and texts may vary in different downstream tasks, we propose to use the attention mechanism (Vaswani et al. 2017) to quantify the correlations between an image and its neighboring texts. Specifically, we compute p i as the weighted combination of the neighboring texts assignments as: p i =f A(ui, VS, PS; WI, WS) j N S k S (xi) softmax((WIui)T WSvj)pj (4) where f A( ; WI, WS) is the attention network to quantify the correlations between an image and its neighboring texts and WI, WS Rd d are two parameter matrices. Then we use the argmax operation to generate one-hot pseudo-label for xi as: q i = one-hot (c, argmaxl p il) (5) where one-hot(c, l) generates a c-bit one-hot vector with the l-th element as 1. Here, q i can be considered as the semantic cluster assignment of xi. Therefore, we perform alignment for each image xi by aligning the semantic cluster assignment q i to the image cluster assignment qi with the following loss function: i=1 CE(qi, q i) (6) where CE(.) is the cross entropy function. The overall alignment loss function is: LA(f I(e I(X); ϕ), f S(e T (T ); θ), f A( ; WI, WS)) = Lia + λpa Lpa + λsa Lsa (7) where λpa and λsa are two trade-off parameters. The Overall Objective LA(f I(e I(X); ϕ), f S(e T (T ); θ), f A( ; WI, WS)) denote as L(g(S; φ)) for simplicity, where S = (X, T ), φ = (ϕ, θ, WI, WS) and g consists of f I, f S and f A. Finally, the overall objective can be formulated as L(g(S; φ)) =LI(g(S; φ)) + λa LA(g(S; φ)) (8) where λa is a trade-off parameter. Theoretical Analysis In this part, we first analyze the convergence of our proposed method and then analyze its expected clustering risk. We first introduce the following assumptions: Image Neighborhood Consistency Bound: xi X, xj N I k I(xi), q T i qj [µI, 1]. Cross-modal Neighborhood Consistency Bound: xi X, tj N S k S(xi), q T i pj [µC, 1]. Image Prediction Confidence Bound: xi X, qi µp. Image Neighborhood Imbalance Bound: xi X, xi is in at most k I samples (in X) nearest neighborhoods. We first give the following theorem demonstrating that the optimization algorithm theoretically converges to the local optima in a sublinear speed. Suppose that g(S; φ) is twice differential with bounded gradients and Hessians, and L(g(S; φ)) has L-Lipschitz continuous gradient. Suppose that the learning rate ηφ satisfies ηφ = min{1, C T } for some C > 0, such T L. Then our proposed method can achieve min0 t T E h L(g(S); φ(t)) 2 2 i ϵ in O 1/ϵ2 steps, where ϵ is a very small positive real number. Next, we analyze the ability of our method to achieve cluster performance on unseen data. Let ˆLn(g) be the empirical clustering risk of MCA and its expectation can be denoted as L(g). The family of g is defined as G. Then we can obtain the following theorem by analyzing the generalization bound of our proposed method. Suppose f I(.; ϕ) is Lipschitz smooth with constant LI and u Mu. Suppose β( 1 n Pn i=1 qilui) = f I(h I, ϕ)T f S(h S, θ) is LIS-Lipschitz continious, where h I l and h S are computed according to the method in Section . For any 0 < δ < 1, we can guarantee that with a probability of at least 1 δ for any g G, the following inequality holds. L(g) b Ln(g) + c1 n + c2 1 2n log δ 1 + 2d LISMu where c1 = 2µ 1 I + 2ηC + 2λam/τia + 2λaλpad LISMu/τpa + 2λaλsac log µ 1 p and c2 = (2 + 2k I) log µ 1 I + ηC + 2λa(1 µC) λaλpa dc LIM 2 u τpa + 2λaλsac log µ 1 p are constants dependent on {n, m, µI, µC, µp, k I, c, LIS, LI, Mu, d, C}. C is a constant. Theorem shows that our proposed method, with high probability 1 δ, is with a bounded expected clustering risk on the unseen data. To summarize, the expected clustering risk of MCA is theoretically guaranteed in clustering tasks. Note that the margin L(g) b Ln(g) is inversely proportional to µI and µC which reflect the neighborhood consistency in both image domain and cross-domain, and µp which reflects the prediction confidence, indicating that improving the neighborhood consistency in both image domain and cross-domain and prediction confidence reduces the expected risk of MCA. Meanwhile, the margin L(g) b Ln(g) The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Dataset STL10 Cifar10 Cifar100-20 Image Net-Dogs Tiny-Image Net Metrics ACC NMI ARI ACC NMI ARI ACC NMI ARI ACC NMI ARI ACC NMI ARI kmeans 19.2 12.5 6.1 22.9 8.7 4.9 13.0 8.4 2.8 10.5 5.5 2.0 2.5 6.5 0.5 SC 15.9 9.8 4.8 24.7 10.3 8.5 13.6 9.0 2.2 11.1 3.8 1.3 2.2 6.3 0.4 NMF 18.0 9.6 4.6 19.0 8.1 3.4 11.8 7.9 2.6 11.8 4.4 1.6 2.9 7.2 0.5 JULE 27.7 18.2 16.4 27.2 19.2 13.8 13.7 10.3 3.3 13.8 5.4 2.8 3.3 10.2 0.6 SAE 32.0 25.2 16.1 29.7 24.7 15.6 15.7 10.9 4.4 DAE 30.2 22.4 15.2 29.7 25.1 16.3 15.1 11.1 4.6 19.0 10.4 7.8 3.9 12.7 0.7 AE 30.3 25.0 16.1 31.4 23.4 16.9 16.5 10.0 4.7 18.5 10.4 7.3 4.1 13.1 0.7 VAE 28.2 20.0 14.6 29.1 24.5 16.7 15.2 10.8 4.0 17.9 10.7 7.9 3.6 11.3 0.6 DEC 35.9 27.6 18.6 30.1 25.7 16.1 18.5 13.6 5.0 19.5 12.2 7.9 3.7 11.5 0.7 ADC 53.0 32.5 16.0 DC 33.4 37.4 18.9 DAC 47.0 36.6 25.6 52.2 40.0 30.1 23.8 18.5 8.8 27.5 21.9 11.1 6.6 19.0 1.7 DDC 48.9 37.1 26.7 52.4 42.4 32.9 DCCM 48.2 37.6 26.2 62.3 49.6 40.8 32.7 28.5 17.3 38.3 32.1 18.2 10.8 22.4 3.8 IIC 59.6 49.6 39.7 61.7 51.1 41.1 25.7 22.5 11.7 PICA 71.3 61.1 53.1 69.6 59.1 51.2 33.7 31.0 17.1 35.2 35.2 20.1 9.8 27.7 4.0 GCC 78.8 68.4 63.1 85.6 76.4 72.8 47.2 47.2 30.5 52.6 49.0 36.2 13.8 34.7 7.5 CC 85.0 76.4 72.6 79.0 70.5 63.7 42.9 43.1 26.6 42.9 44.5 27.4 14.0 34.0 7.1 TCL 86.8 79.9 75.7 88.7 81.9 78.0 53.1 52.9 35.7 64.4 62.3 51.6 SCAN 75.52.0 65.41.2 59.01.6 81.80.3 71.20.4 66.50.4 42.23.0 44.11.0 26.71.3 55.61.5 58.71.3 42.81.3 41.10.5 69.40.3 32.70.4 SCAN 76.71.9 68.01.2 61.61.8 87.60.4 78.70.5 75.80.7 45.92.7 46.81.3 30.12.1 59.20.2 60.80.4 45.30.4 SCAN 80.9 69.8 64.6 88.3 79.7 77.2 50.7 48.6 33.3 59.3 61.2 45.7 42.0 69.8 33.2 NNM 76.81.2 66.31.3 59.61.5 83.70.3 73.70.5 69.40.6 45.90.2 48.00.4 30.20.4 58.61.5 60.40.5 44.90.2 37.80.1 66.30.1 27.10.1 SIC1 95.50.1 92.70.2 91.10.2 78.30.1 74.30.1 66.90.1 51.30.1 53.90.1 36.80.1 59.00.2 57.71.8 41.13.2 55.70.8 77.40.1 44.90.6 SIC2 96.70.1 93.70.1 93.20.1 91.80.1 83.40.1 83.10.1 54.00.1 54.40.4 38.60.4 61.81.1 63.91.9 49.81.4 61.00.2 80.40.1 51.20.2 SIC3 98.10.1 95.30.1 95.90.1 92.60.1 84.70.1 84.40.1 58.30.1 59.30.1 43.90.1 69.71.1 69.01.6 55.81.5 60.20.3 79.40.1 49.40.2 SIC 98.1 95.4 95.9 92.67 84.8 84.6 58.4 59.3 44.0 71.3 71.8 58.6 61.2 80.5 51.4 MCA 98.10.1 95.50.1 96.00.1 92.70.2 84.90.2 84.60.2 59.70.9 59.80.5 44.00.9 74.92.5 73.31.5 61.62.5 61.20.5 79.70.7 51.90.8 MCA 98.2 95.5 96.0 92.8 85.0 84.9 61.2 60.6 45.5 77.9 75.1 64.3 61.9 81.1 52.3 Table 1: Clustering results on five benchmark datasets. The best results are highlighted in bold. is proportional to k I which reflects the neighborhood overlapping in the image domain, indicating that reducing the neighborhood imbalance (e.g., by setting a smaller number of neighbors k I or filtering neighborhoods to reduce neighborhood imbalance) also reduces the expected risk of MCA. Experiments and Analysis In this section, experiments are conducted on five image benchmark datasets to validate the effectiveness of our proposed method. Experimental Setup Benchmarks and implementation details. We used the following five benchmark datasets in our experiment: STL10 (Coates, Ng, and Lee 2011), Cifar10 (Krizhevsky 2009), Cifar100-20 (Krizhevsky 2009), Image Net Dogs (Chang et al. 2017b) and Tiny-Image Net (Le and Yang 2015). Evaluation Metrics. We used three evaluation metrics to evaluate clustering results, including clustering Accuracy (ACC), Normalized Mutual Information (NMI) (Mc Daid, Greene, and Hurley 2011), and Adjusted Rand Index (ARI) (Hubert and Arabie 1985). For these metrics, a higher value means better performance. Comparisons with State-of-the-arts Setup. We took the entire list of nouns in the Word Net dataset (Miller 1995) to form an initial semantic dataset for filtering which contains more than 82, 000 nouns. To evaluate the effectiveness of our proposed method, we compare it with 22 state-of-the-art clustering methods on the five datasets, including kmeans (Mac Queen 1967), SC (Zelnik Manor 2005), NMF (Cai et al. 2009), JULE (Yang, Parikh, and Batra 2016), SAE (Ng et al. 2011), DAE (Vincent et al. 2010), AE (Bengio et al. 2006), VAE (Kingma and Welling 2013), DEC (Xie, Girshick, and Farhadi 2016), ADC (Haeusser et al. 2018), Deep Cluster (DC) (Caron et al. 2018), DAC (Chang et al. 2017a), DDC (Chang et al. 2019), DCCM (Wu et al. 2019), IIC (Ji, Vedaldi, and Henriques 2019), PICA (Huang, Gong, and Zhu 2020), GCC (Zhong et al. 2021), CC (Li et al. 2021),TCL (Li et al. 2022),SCAN(Van Gansbeke et al. 2020), NNM (Dang et al. 2021) and SIC (Cai et al. 2023). We repeated the training five times independently on each dataset and reported their mean and standard deviation values. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Steps Cifar10 Image Net-Dogs UF 91.4 65.3 UF+HF 91.9 71.8 Table 2: Ablation studies of semantic space construction (UF: Uniqueness-based filtering, HF: Hierarchy-based filtering.) Loss Components Result LI Lia Lpa Lsa ACC NMI ARI 46.7 0.5 48.9 0.7 32.6 0.5 66.0 3.4 67.7 2.5 50.7 4.1 63.5 2.5 69.7 2.1 52.7 3.5 53.1 3.8 55.4 2.8 37.4 2.4 74.8 2.7 72.7 2.0 61.9 2.5 Table 3: Ablation studies on Image Net-Dogs. Results. The comparison results with the state-of-the-art methods in terms of ACC, NMI, and ARI are presented in Tabel 1. From this table, we can observe that our method outperforms all other methods on five benchmark datasets. Especially, MCA improves ACC, NMI, and ARI by 2.8%, 1.3%, and 1.5% on Cifar100-20, and 6.6%, 3.3%, and 5.7% on Image Net-Dogs, demonstrating that MCA better amend the incorrect alignments and thus achieve significant performance improvement. Especially for fine-grained images such as Imagenet-dogs whose categories are difficult to distinguish, our method can give useful clustering support information for images through the three alignment strategy, especially for fine-grained images like Image Net-Dogs whose categories are difficult to distinguish. Ablation Studies Semantic space construction. We first conduct an experiment on Cifar10 and Image Net Dogs to verify the effectiveness of our semantic space construction method and show the results in Table 2. The results demonstrate that hierarchy-based filtering can significantly enhance the semantic space compared to uniquenessbased filtering, especially when dealing with complex clusters that are challenging to distinguish, such as those in the Image Net-Dogs dataset. 50 55 60 65 70 45 Accuracy (%) SMP PMCP Ours Epoch (a) Cifar100-20. 15 20 25 30 35 40 45 40 Accuracy (%) SMP PMCP Ours Epoch (b) Image Net-Dogs. Figure 3: The average accuracy of 10 runs of pseudo-labels with epochs on Cifar100-20 and Image Net-Dogs evolves. Figure 4: Example of pseudo-label generation in MCA. The words below (on the right side of) the images are groundtruth/ neighboring labels and the red color indicates irrelevant texts. The blue block in the semantic probability indicates the class the left word is assigned to (with the largest probability). Loss components effectiveness. We perform an ablation analysis on Image Net-Dogs to measure the importance of four loss components in our model, i.e., image consistency loss LI, instance-level alignment loss Lia, prototype-level alignment loss Lpa and semantic-level alignment Lsa. The results are shown in Tabel 3, indicating that each of the four components plays an important role. The integration of three cross-modal alignment strategies significantly enhances the clustering performance, even obtaining 28.1%, 23.8%, and 29.3% performance gains when all three strategies are simultaneously used. Among the three cross-modal alignment strategies, introducing semantic-level alignment yields the most significant improvement in clustering performance, indicating that operating at the semantic level can effectively address incorrect alignments. These results confirm the effectiveness of our proposed cross-modal alignment methods. Comparison of three pseudo-label generation methods. In our method, semantic-level cross-modal alignment can be considered self-training with cross-modal pseudo-labels. To verify the effectiveness of our method, we compared three pseudo-label generation methods implemented in our framework: 1) Single Modal Pseudo-labeling (SMP) that directly generates one-hot pseudo-labels via the argmax operation on the soft cluster assignments only from images, 2) Prototype Mapping based Cross-modal Pseudo-labeling (PMCP, is the adjusted center-based method in (Cai et al. 2023)) that generates pseudo-labels from the prototype level alignments in the original CLIP, and 3) Ours that generates pseudolabels by simultaneously learning relationships among images and neighboring texts, while also updating the image and text embeddings. The comparison results on Cifar10020 and Image Net-Dogs are shown in Figure 3, demonstrating that our method significantly outperforms the other two methods on both datasets. These results indicate that learning adaptive relationships at the semantic level substantially improves the quality of pseudo-labels. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) 1 5 10 15 20 30 50 70 ACC ARI NMI (a) k S on Cifar10. 1 5 10 15 20 30 50 50 ACC ARI NMI (b) k S on Image Net-Dogs. 1 5 10 15 20 30 50 80 ACC ARI NMI (c) kp on Cifar10. 1 5 10 15 20 30 50 50 ACC ARI NMI (d) kp on Image Net-Dogs. Figure 5: Sensitivity analysis of k S and kp. 0 0.1 1 5 10 0 0.1 1 5 10 Accuracy (%) (a) (η,λa) on Image Net-Dogs. 0 0.1 1 5 10 0 0.1 1 5 10 Accuracy (%) (b) (η,λpa) on Image Net-Dogs. 0 0.1 1 5 10 0 0.1 1 5 10 Accuracy (%) (c) (η,λsa) on Image Net-Dogs. 0 0.1 1 5 10 0 0.1 1 5 10 Accuracy (%) (d) (λpa,λsa) on Image Net-Dogs. Figure 6: Sensitivity analysis of trade-off parameters η, λa, λpa and λsa. Sensitivity Analysis Figure 4 shows an example of our proposed pseudo-label generation method. We select an example from each of the three classes and three neighboring words for each image. Although these exist irrelevant neighboring words for an image, our method can identify irrelevant words and eliminate the affection of incorrect alignments in CLIP. Sensitivity on neighborhood parameters k S and kp in cross-modal alignment. In our alignment method, k S controls the number of neighboring texts in the instance-level and semantic-level alignments, and kp controls the number of neighboring texts to recompute the semantic prototype in the prototype-level alignment. Figures 5a and 5b show that too large k S causes performance degeneration due to the introduction of irrelevant texts. Figures 5c and 5d show that kp does not change the performance too much. Sensitivity on trade-off parameters η, λa, λpa and λsa. Figure 6 shows the sensitivity analysis of trade-off parameters η, λa, λpa and λsa. We can see that the performance of our method improves with increasing values of η, λpa, and λsa. Notably, our method appears to be more sensitive to changes in λsa than λpa on Image Net-Dogs. Compare with Zero-shot Learning We compared MCA with two zero-shot learning methods, i.e., CLIP and MUST. Compared to CLIP, MUST improve CLIP by 3.3% on Caltech101 and 18% on UCF101. Compared to MUST, our method deteriorates MUST by -8.2% on Caltech101 and -8.5% on UCF101. Although the results show that our method performs worse than CLIP and MUST, our method has the advantage of not requiring class names as input, which is necessary for zero-shot learning. This makes our approach more versatile and applicable to a broader range of real-world scenarios. We have proposed a novel method to address the incorrect alignments in CLIP for image clustering. Our method includes the construction of a proper semantic space and a multi-level cross-modal alignment approach for aligning images and texts in downstream tasks at three levels. Theoretical results have shown interesting insights, and experimental results have demonstrated the superiority of our method. However, we acknowledge that our method may not be as cost-effective as SIC, as it involves three types of alignments. The proper setting of hierarchy levels also remains a challenge that needs further investigation. Additionally, our method exhibits lower performance compared to MUST, primarily due to the absence of class names. For future work, we will focus on enhancing the performance of our method and exploring new avenues for improvement. Leveraging the theoretical results to augment our method holds promise and will be a key research direction we pursue. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Acknowledgments This work is jointly supported by Major Project of the New Generation of Artificial Intelligence under Grant no.2018AAA0102900; in part by NSFC under Grant no. 92270122 and no.62206179; and in part by Guangdong Provincial Natural Science Foundation under Grant no. 2023A1515012584 and no.2022A1515010129; and in part by the Shenzhen Research Foundation for Basic Research, China, under Grant JCYJ20210324093000002; and in part by University Stability Support program of Shenzhen under Grant no.20220811121315001. References Bengio, Y.; Lamblin, P.; Popovici, D.; and Larochelle, H. 2006. Greedy layer-wise training of deep networks. In Proceedings of neural information processing systems, 153 160. Cai, D.; He, X.; Wang, X.; Bao, H.; and Han, J. 2009. Locality Preserving Nonnegative Matrix Factorization. In Proceedings of IJCAI 2009, 1010 1015. Cai, S.; Qiu, L.; Chen, X.; Zhang, Q.; and Chen, L. 2023. Semantic-Enhanced Image Clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, 6869 6878. Caron, M.; Bojanowski, P.; Joulin, A.; and Douze, M. 2018. Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), 132 149. Chang, J.; Guo, Y.; Wang, L.; Meng, G.; Xiang, S.; and Pan, C. 2019. Deep Discriminative Clustering Analysis. Ar Xiv, abs/1905.01681. Chang, J.; Wang, L.; Meng, G.; Xiang, S.; and Pan, C. 2017a. Deep Adaptive Image Clustering. In In Proceedings of the IEEE international conference on computer vision, 5880 5888. Chang, J.; Wang, L.; Meng, G.; Xiang, S.; and Pan, C. 2017b. Deep Adaptive Image Clustering. In Proceedings of ICCV 2017, 5880 5888. Chen, Y.-C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; and Liu, J. 2020. Uniter: Universal image-text representation learning. In European conference on computer vision, 104 120. Springer. Coates, A.; Ng, A.; and Lee, H. 2011. An Analysis of Single Layer Networks in Unsupervised Feature Learning. In Proceedings of AISTATS 2011, volume 15, 215 223. Dang, Z.; Deng, C.; Yang, X.; Wei, K.; and Huang, H. 2021. Nearest Neighbor Matching for Deep Clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13693 13702. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929. Gao, Q.; Lian, H.; Wang, Q.; and Sun, G. 2020. Cross-modal subspace clustering via deep canonical correlation analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, 3938 3945. Haeusser, P.; Plapp, J.; Golkov, V.; Aljalbout, E.; and Cremers, D. 2018. Associative deep clustering: Training a classification network with no labels. In Proceedings of GCPR 2018, 18 32. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770 778. Huang, J.; Gong, S.; and Zhu, X. 2020. Deep Semantic Clustering by Partition Confidence Maximisation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8846 8855. Hubert, L.; and Arabie, P. 1985. Comparing partitions. Journal of Classification, 2(1): 193 218. Ji, X.; Vedaldi, A.; and Henriques, J. F. 2019. Invariant Information Clustering for Unsupervised Image Classification and Segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, 9864 9873. Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.-H.; Li, Z.; and Duerig, T. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 4904 4916. PMLR. Kingma, D. P.; and Welling, M. 2013. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114. Krizhevsky, A. 2009. Learning Multiple Layers of Features from Tiny Images. Master s thesis, University of Tront. Le, Y.; and Yang, X. 2015. Tiny imagenet visual recognition challenge. CS 231N, 7(7): 3. Li, J.; Savarese, S.; and Hoi, S. C. 2022. Masked unsupervised self-training for zero-shot image classification. ar Xiv preprint ar Xiv:2206.02967. Li, L. H.; Yatskar, M.; Yin, D.; Hsieh, C.-J.; and Chang, K.- W. 2019. Visualbert: A simple and performant baseline for vision and language. ar Xiv preprint ar Xiv:1908.03557. Li, W.; Gao, C.; Niu, G.; Xiao, X.; Liu, H.; Liu, J.; Wu, H.; and Wang, H. 2020. Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. ar Xiv preprint ar Xiv:2012.15409. Li, Y.; Hu, P.; Liu, Z.; Peng, D.; Zhou, J. T.; and Peng, X. 2021. Contrastive clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, 8547 8555. Li, Y.; Yang, M.; Peng, D.; Li, T.; Huang, J.; and Peng, X. 2022. Twin contrastive learning for online clustering. International Journal of Computer Vision, 130(9): 2205 2221. Mac Queen, J. 1967. Some methods for classification and analysis of multivariate observation. Proceedings of the 5th Berkeley Symposium on Mathematical Statistica and Probability, 281 297. Mao, Y.; Yan, X.; Guo, Q.; and Ye, Y. 2021. Deep mutual information maximin for cross-modal clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, 8893 8901. Mc Daid, A. F.; Greene, D.; and Hurley, N. 2011. Normalized mutual information to evaluate overlapping community finding algorithms. ar Xiv preprint ar Xiv:1110.2515. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Miller, G. A. 1995. Word Net: a lexical database for English. Communications of the ACM, 38(11): 39 41. Ng, A.; et al. 2011. Sparse autoencoder. CS294A Lecture notes, 72(2011): 1 19. Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748 8763. PMLR. Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; and Sutskever, I. 2021. Zero-shot text-toimage generation. In International Conference on Machine Learning, 8821 8831. PMLR. Shaham, U.; and Stanton, K. 2018. Spectral Net: Spectral Clustering using Deep Neural Networks. In Proceedings of International Conference of Learning Representation, 1 20. Tian, K.; Zhou, S.; and Guan, J. 2017. Deep Cluster: A General Clustering Framework Based on Deep Learning. In Proceedings of ECML PKDD 2017, 809 825. Van Gansbeke, W.; Vandenhende, S.; Georgoulis, S.; Proesmans, M.; and Van Gool, L. 2020. SCAN: Learning to Classify Images Without Labels. In Proceedings of European conference on computer vision, 268 285. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30. Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; and Manzagol, P.-A. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR, 11(12). Wu, J.; Long, K.; Wang, F.; Qian, C.; Li, C.; Lin, Z.; and Zha, H. 2019. Deep Comprehensive Correlation Mining for Image Clustering. In Proceedings of the IEEE/CVF international conference on computer vision, 8149 8158. Xie, J.; Girshick, R.; and Farhadi, A. 2016. Unsupervised Deep Embedding for Clustering Analysis. In Proceedings of International conference on machine learning, 478 487. Yang, B.; Fu, X.; Sidiropoulos, N. D.; and Hong, M. 2017. Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering. In Proceedings of international conference on machine learning, volume 70, 3861 3870. Yang, J.; Parikh, D.; and Batra, D. 2016. Joint Unsupervised Learning of Deep Representations and Image Clusters. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5147 5156. Zelnik-Manor, L. 2005. Self-tuning spectral clustering. In Proceedings of NIPS 2005, volume 17, 1601 1608. Zhong, H.; Wu, J.; Chen, C.; Huang, J.; Deng, M.; Nie, L.; Lin, Z.; and Hua, X.-S. 2021. Graph contrastive clustering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9224 9233. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)