# distributionconsistencyguided_multimodal_hashing__2bc9d67c.pdf Distribution-Consistency-Guided Multi-modal Hashing Jin-Yu Liu, Xian-Ling Mao*, Tian-Yi Che, Rong-Cheng Tu School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China {liujinyu1229, turongcheng}@gmail.com, {maoxl, ccty}@bit.edu.cn Multi-modal hashing methods have gained popularity due to their fast speed and low storage requirements. Among them, the supervised methods demonstrate better performance by utilizing labels as supervisory signals compared with unsupervised methods. Currently, for almost all supervised multimodal hashing methods, there is a hidden assumption that training sets have no noisy labels. However, labels are often annotated incorrectly due to manual labeling in realworld scenarios, which will greatly harm the retrieval performance. To address this issue, we first discover a significant distribution consistency pattern through experiments, i.e., the 1-0 distribution of the presence or absence of each category in the label is consistent with the high-low distribution of similarity scores of the hash codes relative to category centers. Then, inspired by this pattern, we propose a novel Distribution-Consistency-Guided Multi-modal Hashing (DCGMH), which aims to filter and reconstruct noisy labels to enhance retrieval performance. Specifically, the proposed method first randomly initializes several category centers, each representing the region s centroid of its respective category, which are used to compute the highlow distribution of similarity scores; Noisy and clean labels are then separately filtered out via the discovered distribution consistency pattern to mitigate the impact of noisy labels; Subsequently, a correction strategy, which is indirectly designed via the distribution consistency pattern, is applied to the filtered noisy labels, correcting high-confidence ones while treating low-confidence ones as unlabeled for unsupervised learning, thereby further enhancing the model s performance. Extensive experiments on three widely used datasets demonstrate the superiority of the proposed method compared to state-of-the-art baselines in multi-modal retrieval tasks. Code https://github.com/Liu Jinyu1229/DCGMH Introduction With the rapid growth of multimedia data such as images, text, and videos, achieving effective retrieval from massive multi-modal data has become a significant challenge. To address this challenge, numerous information retrieval technologies have emerged (Jiang and Li 2017; Sung et al. 2018; *Corresponding author. Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Zhang, Lai, and Feng 2018; Chen et al. 2020b), with multimodal hashing methods (Liu, He, and Lang 2014; Chen et al. 2020a; Zhu et al. 2021; Lu et al. 2021; Wu et al. 2022) gaining widespread attention for their fast retrieval speed and low storage requirements. Unlike uni-modal hashing (Tu et al. 2018, 2021c; Guo et al. 2022; Tu, Mao, and Wei 2020; Tu et al. 2021b) and cross-modal hashing (Zhang, Peng, and Yuan 2018; Tu et al. 2022, 2023a, 2021a), multimodal hashing maps data points from different modalities into a unified Hamming space for fusion, resulting in binary hash codes that facilitate efficient multi-multi retrieval. Compared to unsupervised methods (Song et al. 2013; Shen et al. 2015, 2018; Zheng et al. 2020b; Wu et al. 2021a), supervised multi-modal hashing methods (Yang, Shi, and Xu 2017; Xie et al. 2017; Yu et al. 2022; Zheng et al. 2024) generate more discriminative hash codes and achieve more accurate retrieval by utilizing labels as supervisory signals and there is a hidden assumption that training sets have no noisy labels for these supervised methods. However, in real-world scenarios, labels may be incorrectly annotated due to manual labeling, such as an image that should be labeled as tiger being mistakenly labeled as cat , which limits the applicability of existing supervised hashing methods in noisy label scenarios. While some works (Sun et al. 2022; Yang et al. 2022) in image hashing and cross-modal hashing have demonstrated that the presence of noisy labels in the training set can lead to model overfitting, resulting in indistinguishable hash codes and inaccurate retrieval, no work has yet focused on and resolved this issue in the field of multi-modal hashing. To effectively tackle this issue, it is crucial to filter out noisy labels from the dataset. Existing single or cross-modal hashing methods have demonstrated that models initially learn effective hash mappings from clean labels but eventually overfit to noisy labels, resulting in degraded performance. This phenomenon suggests that even after a short training period, the generated hash codes are somewhat discriminative and tend to align with their corresponding category centers (Yuan et al. 2020; Tu et al. 2023b), as indicated by higher similarity scores. Thus, the 1-0 distribution of the category s presence or absence, i.e., the label vector, should be consistent with the high-low distribution of similarity scores between the hash codes and category centers. Specifically, if an instance belongs to a category, the corre- The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25) Noisy Label Combined Label Clean Label Dataset Average Similarity Score Out-Category(0) In-Category(1) Figure 1: Box Plot comparison of average similarity scores for in-category and out-category across clean and noisy label datasets, where Out-Category(0) represents the box plot distribution of the average similarity scores of hash codes to all categories it does not belong, while In-Category(1) represents the box plot distribution for the belonging categories, and the horizontal line within each box indicates the median of all average similarity scores. sponding label bit should be 1 , and the hash code should show a higher similarity score with the category center; otherwise, the bit should be 0 , and the similarity score should be lower. Therefore, we hypothesize that this pattern of distribution consistency can effectively filter out noisy labels. To validate this hypothesis, we conduct a Box Plot statistical analysis comparing the average similarity scores between hash codes and their respective in-category and outcategory centers. This analysis is performed on the MIR Flickr dataset which contains a training set of 5,000 instances with a noisy label ratio of 40%. After training the model for 10 epochs, we separate the dataset into noisy and clean subsets and analyze the distribution of similarity scores. The results, shown in Figure 1, reveal that in the clean label dataset, the similarity scores of hash codes to their respective category centers are significantly higher compared to non-belonging centers. In contrast, in the noisy label dataset, this difference is less pronounced due to the misalignment between assigned categories and actual hash code semantics. This observation confirms that there is a notable consistency between the 1-0 label distribution and the high-low similarity score distribution in clean labels, which is disrupted in noisy labels. Therefore, by exploiting these consistency differences, we can effectively filter out noisy labels, supporting our hypothesis. Consequently, inspired by the distribution consistency pattern and the hypothesis, we propose a novel Distribution Consistency-Guided Multi-modal Hashing (DCGMH) via the consistency between the 1-0 distribution of labels and the high-low distribution of similarity scores to filter and reutilize noisy labels, thereby preventing the model from overfitting to noisy labels and improving its retrieval performance in real-world scenarios. Specifically, the proposed method first randomly initializes several category centers to compute the similarity scores of the hash codes relative to each category center, with each center representing the central region of its corresponding category; Then, based on the distribu- tion consistency pattern, the noisy and clean labels are separately filtered out to mitigate the impact of noisy labels; Subsequently, for the noisy label set, we design a reconstruction strategy via the distribution consistency pattern to correct high-confidence noisy labels, while low-confidence noisy label instances are treated as unlabeled instances to facilitate unsupervised learning for extracting semantic information and further enhancing the performance of the multi-modal hashing model. In conclusion, the main contributions of the proposed method are as follows: We discover the consistency between the 1-0 distribution of the presence or absence of each category in the label and the high-low distribution of similarity scores of the hash codes relative to each category center and validate it through Box Plot statistical analysis. We design a filter via the distribution consistency pattern to filter out noisy labels and improve the applicability of supervised multi-modal hashing methods in real-world scenarios. To the best of our knowledge, no similar work has been done. We design a corrector via the distribution consistency pattern to correct high-confidence noisy labels and utilize low-confidence ones for unsupervised learning. Extensive experiments on three benchmark datasets demonstrate the proposed method outperforms the stateof-the-art baselines in multi-modal retrieval tasks. Related Work Multi-modal Hashing In multi-modal hashing, supervised methods utilize labels as supervisory information to help models learn richer semantic information, and can be categorized into shallow and deep approaches based on whether deep networks are utilized. Shallow supervised multi-modal hashing (Lu et al. 2019a,c; Zheng et al. 2019, 2022; An et al. 2022) often relies on linear mapping or matrix factorization to model the latent semantic associations between modalities. For instance, OMHDQ (Lu et al. 2019b) links hash code learning with both low-level data distribution and high-level semantic distribution based on paired semantic labels. SAPMH (Zheng et al. 2020a) employs paired semantic labels as parameter-free supervisory information to learn multidimensional latent representations. In contrast, deep supervised multi-modal hashing (Yan et al. 2020; Zhu et al. 2020; Shen et al. 2023; Lu et al. 2020; Tan et al. 2023) leverages deep networks to integrate feature extraction and hash code learning into a unified deep framework. For example, BSTH (Tan et al. 2022) introduces a bit-aware semantic transformation module to achieve fine-grained, concept-level alignment and fusion of multi-modal data, thereby generating high-quality hash codes. STBMH (Tu et al. 2024) addresses the issue of similarity transitivity broken in multi-label scenarios by designing an additional regularization term. For almost all supervised multi-modal hashing methods, there is a hidden assumption that training sets have no noisy labels. However, in real-world scenarios, the labels are often annotated incorrectly due to manual labeling which greatly harms the model s performance. This issue necessitates effective strategies for noisy label learning to maintain model robustness and accuracy. Noisy Label Learning Noisy label learning has been extensively studied in tasks such as image classification. Existing approaches to handling noisy labels can be broadly categorized into two types: noise-robust modeling (Song et al. 2020, 2022; Shu et al. 2019) and label-noise cleaning (Zheng et al. 2020c; Wu et al. 2021b; Kim et al. 2021; Wei et al. 2022). Noise-robust modeling involves directly training robust models on noisy labels using noise-specific loss functions or regularization terms; for instance, NCR (Iscen et al. 2022) designs a regularization loss term based on the consistency between an instance and its neighboring nodes in the feature space. In contrast, labelnoise cleaning focuses on filtering or correcting noisy labels directly, exemplified by FCF (Jiang et al. 2024), which introduces a fusion cleaning framework that combines correction and filtering to address different types of noisy labels. In the context of hashing, there have been works addressing noisy labels in image hashing (Sun et al. 2022; Wang et al. 2023) and cross-modal hashing (Yang et al. 2022; Li et al. 2024) by using loss-based methods. For instance, DIOR (Wang et al. 2023) uses the concept of behavior similarity between original and augmented views to filter noisy labels, while CMMQ (Yang et al. 2022) designs a proxybased contrastive loss to mitigate the impact of noisy labels. However, no existing work addresses the issue of noisy labels in multi-modal hashing retrieval. Based on the discovered distribution consistency pattern, we propose a novel approach called DCGMH to filter and reutilize noisy labels, improving the robustness of the hashing model. The specific details of this approach will be discussed in the next section. The Proposed Method In the section, we first describe the problem definition and then provide detailed explanations of our proposed method s architecture. Subsequently, we summarize the objective function and optimize the model s training process. Finally, we demonstrate the out-of-sample extension. Problem Definition Similar to most existing multi-modal hashing methods, this work focuses on image-text datasets. Assuming that there is a dataset O containing n instances, denoted as O = {oi}n i=1 = {xi, yi, li}n i=1, where xi and yi represents the text and image modal data point, respectively. Moreover, li {0, 1}m represents the label vector of the instance oi, where m is the total number of categories, and when an instance oi belongs to the category j, lij = 1; otherwise, lij = 0. Furthermore, the similarity matrix S { 1, 1}n n is employed to represent the similarity between instances, such that when instances oi and oj share at least one category, sij = 1, indicating they are similar; otherwise, sij = 1, indicating they are dissimilar. Additionally, the instances in dataset O are ultimately mapped to hash codes B { 1, 1}n k in the Hamming space while preserving the original semantics, where k represents the length of the hash codes. Architecture Multi-modal Hashing Network To generate high-quality hash codes, we use instance oi = {xi, yi, li} as the training data for the hashing network. Then, we employ a bagof-words (Bo W) model to extract the feature representation f x i and a VGG model without the final classification layer to extract the feature representation f y i for the text and image modal data, respectively. Subsequently, modalityspecific multi-layer perceptions (MLP) are utilized to map the feature representation f i , {x, y} of each modality to a unified space, denoted as: u i = MLP (f i ; θ ) (1) where u i is the resulting projected feature representation and θ denotes the learnable parameters of the MLP for the respective modality. Finally, the projected features from different modalities are fused by directly summing them, and this fused representation is then passed through a hashing function with a non-linear activation function (tanh( )) to generate the fused hash code ˆbi, denoted as: ˆbi = H(ux i + uy i ; θ) (2) where H refers to the hash function and θ is a set of learnable parameters. Finally, the final binary hash code bi for instance oi can be obtained as bi = sgn(ˆbi), where sgn( ) is a function that maps positive values to 1 and negative values to -1. In summary, for instance oi, its binary hash code can be formulated as bi = sgn(F(xi, yi; P )), where P is a set of learnable parameters. Label Filter To filter noisy labels via our discovered distribution consistency pattern, we first randomly initialize several category centers C Rm k, i.e., C = {cj}m j=1, where m is the number of categories and k is the length of hash code, ensuring that each category occupies a distinct and non-overlapping region. Then for each instance s hash code ˆbi, we calculate its similarity scores dij with each category center cj, denoted as: dij = ˆbi ˆbi ( cj By aggregating the similarity scores, we obtain the similarity scores matrix D Rn m which captures the similarity of hash code to each category center. In this matrix, when a hash code ˆbi belongs to a particular category cj, the corresponding similarity score dij tends to be high; otherwise, it is relatively low. Next, based on the consistency pattern between the 1-0 distribution of labels and the high-low distribution of similarity scores, we calculate the consistency level T = {ti}n i=1 between the label distribution li and similarity distribution di. We then design a consistency-based criterion to filter the dataset O into the clean label set Oc and noisy label set On, which can be formulated as: Pm j=1 lijdij Pm j=1 lij (4) Oc = {(xi, yi, li)|ti > ϵ(τ)} (5) On = {(xi, yi, li)|ti ϵ(τ)} (6) where ti is employed to measure the degree of consistency between the label distribution and similarity distribution, with higher values indicating greater consistency, τ represents the noise ratio and ϵ(τ) denotes the filtering threshold to ensure τn instances are identified as noisy label instances. Label Reconstructor To accurately learn the semantic knowledge and associations between labels and hash codes in the subsequent steps, we reconstruct the labels in Oc and On. On the one hand, since the labels in Oc are clean, we directly use the initial labels as the reconstructed labels and adopt a standard measure in multi-modal hashing to establish semantic associations between hash codes and labels to learn implicit knowledge. On the other hand, for instances in On, recognizing the significant advantage of labels as guiding information, we design a corrector via distribution consistency pattern to correct high-confidence noisy labels. Specifically, we treat the clean label set Oc as a knowledge base, and for a given instance oi = {xi, yi, li} in the noisy label set On, we identify the two instances oj = {xj, yj, lj} and ok = {xk, yk, lk} from Oc whose distribution of similarity scores dj and dk most closely match that of oi. Here if the labels lj and lk of oj and ok are consistent, we infer that oi has a high confidence of sharing the same label; otherwise, we consider oi as having low confidence and treat it as unlabeled data. This corrector can be represented as follows: mi = di(Dc)T (7) {oj, ok} = arg min oj,ok Oc (rank(mi, j) + rank(mi, k)) li = lj, lj = lk; None, otherwise. (9) where Dc is the similarity scores matrix of clean label set Oc, mi represents the level of consistency between the similarity distribution of the noisy label instance oi and each instance in the clean label set Oc, and rank( , ) indicates the ranking of consistency levels. As for low-confidence noisy labels, we discard the original labels and treat them as unlabeled. Consequently, the noisy label set On is further divided into a corrected label set Or and an unlabeled set Ou. Objective Function and Optimization Pointwise Learning for Clean Label Set To ensure that the generated fused hash codes accurately reflect label semantics, inspired by STBMH (Tu et al. 2024), we design the following pointwise loss on the clean label set Oc to make the hash codes as close as possible to their corresponding categories while keeping them distant from non-relevant categories, denoted as: j Yi log exp( 1 kˆb T i cj) kˆb T i cj) + P h Ni exp( 1 kˆb T i ch) , where k is the length of hash code, nc is the number of instances in Oc, Yi contains the indices of all categories to which instance oi belongs and Ni contains the indices of categories to which it does not belong. By minimizing Lo, the value of exp( 1 kˆb T i cj) becomes significantly higher than h Ni exp( 1 kˆb T i ch), indicating that the similarity be- tween the hash code and its corresponding category center is high, while the similarity with non-relevant category centers is low, thereby achieving the desired objective. Pairwise Learning for Corrector Label Set Meanwhile, for the corrected label set Or, since the labels within still have the potential for error correction, we adopt a pairwise loss to emphasize the relative similarity relationships among instances as a whole, rather than direct associations between instances and specific individual categories, expressed as follows: j cos(ˆbi,ˆbj) sij 2 F (11) cos(ˆbi,ˆbj) = ˆb T i ˆbj ˆbi ˆbj = 1 k ˆb T i ˆbj (12) where k is the length of hash code, nr is the number of instances in Or, cos( , ) is employed to measure the cosine similarity between hash codes and sij represents the pairwise similarity defined by the labels. By minimizing La, the cosine similarity between hash codes of similar instances defined by the labels will approach 1 and their Hamming distance d H(ˆbi,ˆbj) will decrease, where d H(ˆbi,ˆbj) = 1 2(k ˆb T i ˆbj). Unsupervised Learning for Unlabeled Set For the unlabeled set Ou, given that text and image data points inherently contain rich semantic knowledge, we employ unsupervised contrastive learning to uncover hidden semantic relationships and further enhance model performance. Specifically, for an instance oi in Ou, we first generate an augmented instance o i through data augmentation. Then due to oi and o i embody the same semantic meaning, their corresponding hash codes should be as consistent as possible. Therefore, we design the following contrastive loss to mini- mize the distance between hash code ˆbi and ˆb i=1 (1 sii) + 1 nu2 j=1,j =i max(0, sij ϵ) where sij = cos(ˆbi,ˆb j) i.e. cosine similarity of hash codes, nu is the number of instances in Ou and ϵ is the threshold that restricts the similarity between different instance pairs. By minimizing Lu, the similarity between the same instance pairs will approach 1, while the similarity between different instance pairs will be less than the threshold ϵ, thus ensuring that more similar instances are mapped to closer hash codes and achieving the goal of extracting the inherent semantics of the instance through unsupervised learning. Method MIR Flickr NUS-WIDE MS COCO 16 32 64 128 16 32 64 128 16 32 64 128 FOMH 0.697 0.726 0.733 0.742 0.586 0.604 0.657 0.662 0.425 0.440 0.468 0.485 OMHDQ 0.762 0.771 0.777 0.783 0.672 0.689 0.704 0.717 0.522 0.538 0.546 0.573 SDMH 0.772 0.782 0.794 0.798 0.666 0.690 0.698 0.713 0.396 0.398 0.399 0.399 SAPMH 0.751 0.783 0.800 0.806 0.639 0.670 0.677 0.692 0.438 0.442 0.458 0.463 DCMVH 0.715 0.725 0.742 0.754 0.590 0.612 0.667 0.685 0.399 0.414 0.477 0.499 BSTH 0.705 0.715 0.736 0.757 0.559 0.591 0.615 0.640 0.503 0.524 0.538 0.561 NCH 0.751 0.759 0.769 0.778 0.618 0.634 0.641 0.664 0.517 0.533 0.551 0.569 GCIMH 0.738 0.746 0.751 0.756 0.600 0.604 0.618 0.623 0.488 0.499 0.516 0.523 STBMH 0.748 0.767 0.779 0.795 0.622 0.656 0.681 0.691 0.509 0.540 0.564 0.589 DIOR 0.750 0.781 0.820 0.835 0.636 0.692 0.737 0.754 0.482 0.509 0.560 0.596 Ours 0.796 0.823 0.846 0.850 0.717 0.739 0.755 0.757 0.551 0.591 0.631 0.649 Table 1: MAPs at a noisy label ratio of 40% for different hash code lengths across the three benchmark datasets. Center Learning and Quantization In addition, each category should occupy a distinct region without interference. To achieve this, the centers of the categories must be wellseparated, which lead us to design the following center loss: dij = |ci cj 2 F (14) Lc = 1 |{dij|i < j}| i