# personalized_clustering_via_targeted_representation_learning__e9dffe36.pdf Personalized Clustering via Targeted Representation Learning Xiwen Geng1,2, Suyun Zhao1,2*, Yixin Yu3, Borui Peng3, Pan Du1,2 Hong Chen1,2, Cuiping Li1,2, Mengdie Wang1,2 1Key Lab of Data Engineering and Knowledge Engineering of MOE Renmin University of China 2School of Information, Renmin University of China 3School of Statistics, Remin University of China {xiwen, zhaosuyun, yuyixin3642, pengborui, chong, licuiping}@ruc.edu.cn, {du pan, wangmd ruc}@163.com Clustering traditionally aims to reveal a natural grouping structure within unlabeled data. However, this structure may not always align with users preferences. In this paper, we propose a personalized clustering method that explicitly performs targeted representation learning by interacting with users via modicum task information (e.g., must-link or cannot-link pairs) to guide the clustering direction. We query users with the most informative pairs, i.e., those pairs most hard to cluster and those most easy to miscluster, to facilitate the representation learning in terms of the clustering preference. Moreover, by exploiting attention mechanism, the targeted representation is learned and augmented. By leveraging the targeted representation and constrained contrastive loss as well, personalized clustering is obtained. Theoretically, we verify that the risk of personalized clustering is tightly bounded, guaranteeing that active queries to users do mitigate the clustering risk. Experimentally, extensive results show that our method performs well across different clustering tasks and datasets, even when only a limited number of queries are available. 1 Introduction Deep clustering refers to some unsupervised machine learning techniques that leverage deep learning to reveal the grouping structures hidden in data samples. Unlike traditional clustering methods, deep clustering aims to learn latent representations of the data that facilitate clustering in terms of its underlying patterns (Xie, Girshick, and Farhadi 2016; Caron et al. 2018; Li et al. 2021; Zhong et al. 2021; Liu et al. 2022; Li et al. 2022, 2023). Usually, most of the existing methods conduct clustering with the unique goal of maximizing clustering performance, while ignoring the personalized demand of clustering tasks. In real scenarios, however, users may tend to cluster unlabeled data according to their preferences, such as distinctive objectives (animals, architectures, and characters etc.), and then some personalized clustering demands are put forward. For example, Fig.1 depicts the diverse criteria of clustering. There are multiple objects in each image, such as animals and architectures. Some users require clustering in terms of *Corresponding authors. Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: The diversity of cluster orientation. Different tasks have different orientations for feature learning and image clustering. animals, so the two images with dogs should be grouped in the same cluster, classified as dogs. In contrast, some others may prefer to group images according to architectures; thus, the two images with the Eiffel Tower should be clustered together. In cases with such personalized demands, many existing clustering techniques may decline or even be unworkable without user guidance.Therefore, it is still a challenging problem to cluster along a desired orientation. A candidate solution to this issue is constrained clustering, which incorporates prior knowledge through constraints to guide the neural network toward a desired cluster orientation (Wagstaff et al. 2001). The most commonly used constraint is the pairwise constraint, which provides information by indicating whether a pair of samples belongs to the same (must-link) or distinct (cannot-link) clusters. While they use constraints to facilitate models to achieve superior clustering performance, rather than towards a desired configuration. Sometimes, randomly selected constraint pairs (Manduchi et al. 2021; Sun et al. 2022) may lead to inferior clustering results in personalized scenarios, just as demonstrated in The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25) the experiments in Section 4. Accordingly, actively querying informative sample pairs may facilitate personalized clustering. Unfortunately, several proposed active query strategies (Sun et al. 2022; Pei, Liu, and Fern 2015; Biswas and Jacobs 2014) tend to pick those samples beneficial for the default cluster orientation (Sun et al. 2022). Typically, their learned representations do not align with the desired cluster orientation, resulting in poor performance on personalized clustering (Angell et al. 2022). To narrow this gap, we propose the personalized clustering model via targeted representation learning (PCL), which selects the most valuable sample pairs to learn targeted representations, thereby achieving clustering in the desired orientation. Specifically, we propose an active query strategy which picks those pairs most hard to cluster and those most easy to mis-cluster, to facilitate the representation learning in terms of the clustering task. Simultaneously, a constraint loss is designed to control targeted representation learning in terms of the cluster orientation. By exploiting attention mechanism, the targeted representation is augmented. Extensive experiments demonstrate that our method is effective in performing personalized clustering tasks. Furthermore, by strict mathematical reasoning, we verify the effectiveness of the proposed PCL method. To summarize, our main contributions are listed as follows: We propose a novel model, called PCL, for a personalized clustering task, which leverages active query to control the targeted representation learning. A theoretical analysis on clustering is conducted to verify the effectiveness of active query in constraint clustering. Simultaneously, the tight upper bound of generalization risk of PCL is given. Extensive experiments show that our model outperforms five unsupervised clustering approaches, four semi-supervised clustering approaches on three image datasets. Our active query strategy also performs well when compared to other methods. The remainder of this paper is organized as follows. Related works are reviewed in Section 2. We propose our method in Section 3. Furtherly, Section 4 evaluated our proposed approach experimentally. We conclude our work in Section 5. 2 Related Work 2.1 Deep Clustering Deep neural networks have been explored to enhance clustering performance due to their powerful ability of representation learning (Vincent et al. 2010; Kingma and Welling 2014). A notable method is DEC (Xie, Girshick, and Farhadi 2016), which optimizes cluster centers and features simultaneously by minimizing the KL-divergence in the latent subspace. Similarly, IDFD (Tao, Takagi, and Nakata 2021) improves data similarity learning through sample discrimination and then reduces redundant correlations among features via feature decorrelation. Recently, more and more methods achieve clustering representations by leveraging contrastive learning techniques, thereby facilitating deep clustering. For instances, DCCM (Wu et al. 2019) mines various kinds of sample relations by contrastive learning techniques and then train its network using the highly-confident information. PICA (Huang, Gong, and Zhu 2020) minimizes the cosine similarity between cluster-wise assignment vectors to achieve the most semantically plausible clustering solution. Moreover, CC (Li et al. 2021) proposes a dual contrastive learning framework aimed at achieving deep clustering representations. GCC (Zhong et al. 2021) selects positive and negative pairs using a KNN graph constructed on sample representations. Meanwhile, SPICE (Niu, Shan, and Wang 2022) divides the clustering network into a feature model and a clustering head, splitting the training process into three stages. Though these deep clustering methods perform well in default orientation, they work poorly in personalized tasks. 2.2 Constrained Clustering Constrained clustering refers to a type of clustering that incorporates additional information into the clustering process, e.g., pairwise (must-link and cannot-link) constraints (Chien, Zhou, and Li 2019). Cop-Kmeans (Wagstaff et al. 2001) was the first to introduce pairwise constraints to traditional K-means clustering algorithms to enhance clustering performance, and many subsequent works are then proposed (Basu, Banerjee, and Mooney 2002; Bilenko, Basu, and Mooney 2004; Zheng and Li 2011; Yang et al. 2012, 2013). In recent years, constrained clustering approaches have been combined with deep neural networks. For instance, by introducing pairwise constraints, SDEC (Ren et al. 2019) extends the work of DEC (Xie, Girshick, and Farhadi 2016), and DCC (Zhang, Basu, and Davidson 2019) extends the work of IDEC (Guo et al. 2017). Bai et al. explored ways to improve clustering quality in scenarios where constraints are sourced from different origins (Bai, Liang, and Cao 2020). DC-GMM (Manduchi et al. 2021) explicitly integrates domain knowledge in a probabilistic form to guide the clustering algorithm toward a data distribution that aligns with prior clustering preferences. CDAC+ (Lin, Xu, and Zhang 2019) constructs pair constraints to transfer relations among data and discover new intents (clusters) in dialogue systems. Pairwise constraints are widely used to direct clustering models. Some constrained clustering methods also integrate active learning (Ren et al. 2022), which aims to query the most informative samples to reduce labeling costs while maintaining performance (Ashtiani, Kushagra, and Ben-David 2016). COBRA (van Craenendonck, Dumancic, and Blockeel 2018) merges small clusters resulting from K-means into larger ones based on pairwise constraints by maximally exploiting constraint transitivity and entailment. ADC (Sun et al. 2022) constructs contrastive loss by comparing pairs of constraints and integrates deep representation learning, clustering, and data selection into a unified framework. Although these methods typically cluster according to prior preferences, they do not consider the diverse criteria of Figure 2: The framework of PCL. A deep neural network creates representations from two random augmentations of the data. By assessing the position of images in the feature space, PCL selects the most informative sample pairs to guide the training of the model. The model is then retrained by querying whether these pairs are must-link or cannot-link. This approach allows the final model to concentrate on features relevant to the desired cluster orientation, resulting in accurate clustering outcomes. clustering but focus on improving performance or reducing the query budget. Dasgupta and Ng proposed an active spectral clustering algorithm that specifies the dimension along the sentiment embedded in the data points, which is similar to our problem of diverse image clustering (Dasgupta and Ng 2010). However, their method relies on label constraints and is designed for natural language processing. 3.1 Personlized Clustering We propose a novel Personalized Clustering model (PCL) that leverages active learning to guide the model s clustering in a specified orientation. Using a designed pairwise scoring function within a query strategy, certain image pairs are selected for annotation. Oracles then judge whether these image pairs belong to the same class, responding with YES or NO. These responses form must-link or cannot-link constraints, which are incorporated into the construction of a constrained contrastive loss, adjusting the network parameters through backpropagation. By integrating the crossinstance attention module in this way, the model can capture features that align more closely with the desired cluster orientation (Gan et al. 2023). The specific framework of PCL is illustrated in Figure 2. Given a mini-batch of images X = {x1, ..., x N} where N is the batch size, we apply two stochastic data transformations, T a and T b, from the same family of augmentations T , resulting in augmented data {x1, ..., x N, x N+1, ..., x2N}. This process constructs the initial positive pairs (xi, xi+N)N i=1. The network then extracts preliminary features from the augmented data and refines them using the cross-instance attention module. It projects these representations into the feature space, yielding target features {z1, ..., z2N}, and into the cluster space, yielding assignment probabilities {p1, ..., p2N}. These are expressed in matrices Z R2N M and P R2N K, where M is the feature dimension and K is the number of clusters. Next, we select the most informative pair based on the scoring function, submit a query to the oracle, and document the feedback in the indicator matrix 1+ and 1 . These matrices represent the relationship of pairs, where 1+ ij = 1 if yi = yj(must-link) and 1 ij = 1 if yi = yj(cannot-link). Meanwile, due to the initial positive pairs, we use the indicator matrix 1 to represent the positive pair (xi, xj) is constructed by data augment, where 1ij = 1 if (xi, xj) is constructed by data augment. These constraints guide the clustering process to the desired orientation via the constrained contrastive loss. In the following two subsections, we first introduce our constrained contrastive loss in Subsection 3.1, followed by the pairwise query strategy in Subsection 3.2. 3.2 Constrained Contrastive Loss To achieve effective and targeted clustering, we employ a constrained contrastive loss similar to info NCE (Oord, Li, and Vinyals 2018) to evaluate the loss of each sample. Different weights are set for positive pairs based on their construction methods. We focus on clustering impacts in both feature and clustering spaces, while also aiming to differentiate category representations in the clustering space. In the feature space, the loss function for a positive pair of examples(i, j) is defined as: l(zi, zj, Z) = log exp(s(zi, zj)/τ) k=1 (1 + 1 ik) exp(s(zi, zk)/τ) , (1) where s(a, b) = (ab )/( a b ) measures pair-wise similarity by cosine distance, µi = N/(N + P2N j=1(1 ij)) normalizes the range of the denominator, and τ is the temperature parameter. Positive pairs from data augmentation and queries are weighted differently in the loss term. A symmetric matrix W R2N 2N is constructed with elements: N+, if 1ij = 1 λN, if 1+ ij = 1 0, otherwise (2) where λ is the hyperparameter and N+ is the total number of positive pairs constructed by querying. Figure 3: (Left) The illustration shows a sample pair with uncertain similarity, highlighted as red nodes, which are prioritized for querying. (Right) Some samples, due to incorrect feature extraction, are mistakenly grouped together despite belonging to different clusters, as shown by the red and yellow nodes. Identifying these can be achieved by assessing the similarity between their positive samples. The loss of sample i is defined as: wij(l(zi, zj, Z) + l(pi, pj, P)) P2N k=1 wik . (3) In the clustering space, it is crucial to ensure that each cluster remains distinct. Letting C = P RK 2N, we obtain a set {c1, ..., c K, c K+1, ..., c2K}, where ci(i K) and ci+K represent the i-th cluster under augmentations T a(X) and T b(X). A loss function is defined to differetiate i-th cluster from others: ˆℓi = l(ci, ci+N, C) + l(ci+N, ci, C) (4) Our overall loss is defined as: LP C = 1 2N i=1 ℓi + 1 2K i=1 ˆℓi + H( ˆY ) (5) where H( ˆY ) = PK i=1[P( ˆYi)log(P( ˆYi))] serves as a regularization term. 3.3 Query Strategy Our model learns features conducive to clustering by utilizing a constrained contrastive loss, emphasizing the importance of selecting informative sample pairs to guide cluster orientation. We propose an innovative query strategy that considers both the uncertainty of sample pairs and hard negatives, as illustrated in Fig.3. This strategy identifies critical sample pairs to refine cluster orientation and enhance performance. Since features extracted by untrained models are initially unreliable, random constraints are generated on the data before training, and the model is pre-trained before applying the query strategy. We believe that the initial aim is to establish a general cluster orientation and it becomes essential to rectify samples misclassified by the model and make minor adjustments to the cluster orientation later. Therefore, at the outset of the task, the uncertainty of sample pairs takes precedence. As the task progresses, focusing on hard negatives becomes crucial. Uncertainty of pairs. To establish cluster orientation, we prioritize the selection of the most uncertain sample pairs. Cosine similarity of sample pairs ranges from 1 to 1. Pairs with similarity near 1 likely belong to the same cluster, whereas pairs near 1 are likely in different clusters. Pairs with similarity close to 0 are uncertain for the model, offering both similar and distinct features that provide valuable insights into cluster orientation. The uncertainty score for an sample pair (xi, xj) is defined as: Sup(xi, xj) = σ[ |s(zi, zj) ϵ|], (6) where zi and zj are the features of samples xi and xj in feature space. Similarity is averaged over two data augmentations. σ[ ] denotes the Min-Max normalization operator (Jain, Nandakumar, and Ross 2005), and ϵ is a small positive number as must-link constraints are more informative than cannot-links (Sun et al. 2022). Hard negatives. Hard negatives are samples with a significant discrepancy between the actual clustering and the current results of the model. Although actual clustering results cannot be directly observed, they can be inferred using known positive and negative examples (Zhuang and Moulin 2023). For two samples xi and xj, P(xi, xj) = σ[s(xi, xj)] represents the probability that these samples are in the same cluster. Q(xi, xj) = σ[PN k=1(1+ jks(xi, xk) 1 jks(xi, xk))] represents the probability calculated from the known pairs. Relative entropy is used to assess the difference between P(xi, xj) and Q(xi, xj), identifying challenging samples to classify: Shp(xi, xj) = σ[P(xi, xj) log P(xi, xj) Q(xi, xj)]. (7) Query strategy. The strategy for querying the most informative sample pairs combines the scores defined above. The joint query score is given by: S(xi, xj) = r[Sup(xi, xj)] + (1 r)Shp(xi, xj) (8) where r is the ratio of the remaining query budget to the total budget, indicating that the weight of hard-to-learn pairs increases with more queries. As queries increase, the cluster orientation is refined and the strategy focuses on maximizing performance. The full querying and clustering algorithm1 of the model is detailed in the Appendix.2 3.4 Theoretical Analysis We focus on the feature extractor function f F : X Z which determines the final clustering results (Saunshi et al. 2019). The unsupervised loss of sample i is defined as: wij(l(f(xi), f(xj), Z) P2N k=1 wik . (9) 1Code is available at https://github.com/hhdxwen/PCL. 2Appendix is available at https://arxiv.org/abs/2412.13690. Definition 3.1 (Unsupervised Loss) If we can query q times, the overall unsupervised loss is defined as: ˆLq(f) := 1 2N i=1 ℓ i(f), q := X 1 i