# towards_universal_backwardcompatible_representation_learning__6c17d74c.pdf Towards Universal Backward-Compatible Representation Learning Binjie Zhang1,2 , Yixiao Ge2 , Yantao Shen4 , Shupeng Su2 , Fanzi Wu4 , Chun Yuan1 , Xuyuan Xu3 , Yexin Wang3 , Ying Shan2 1Tsinghua University 2ARC Lab, Tencent PCG 3AI Technology Center of Tencent Video 4AWS/Amazon AI {zbj19@mails,yuanc@sz}.tsinghua.edu.cn, {yixiaoge,yingsshan}@tencent.com Conventional model upgrades for visual search systems require offline refreshment of gallery features by feeding gallery images into new models (dubbed as backfill ), which is time-consuming and expensive, especially in large-scale applications. The task of backward-compatible representation learning [Shen et al., 2020] is therefore introduced to support backfill-free model upgrades, where the new query features are interoperable with the old gallery features. Despite the success, previous works only investigated a close-set training scenario (i.e., the new training set shares the same classes as the old one), and are limited by more realistic and challenging open-set scenarios. To this end, we first introduce a new problem of universal backward-compatible representation learning, covering all possible data split in model upgrades. We further propose a simple yet effective method, dubbed as Universal Backward-Compatible Training (Uni BCT) with a novel structural prototype refinement algorithm, to learn compatible representations in all kinds of model upgrading benchmarks in a unified manner. Comprehensive experiments on the large-scale face recognition datasets MS1Mv3 and IJB-C fully demonstrate the effectiveness of our method. Source code is available at https://github.com/Tencent ARC/Open Compatible. 1 Introduction Deep learning-based methods [Ghifary et al., 2016; He et al., 2016; Li and Hoiem, 2017] have achieved great success in visual search tasks, such as face recognition [Liu et al., 2017; Yang et al., 2017; Wang et al., 2017; Wang et al., 2018; Deng et al., 2019; Zhang et al., 2020] and landmark retrieval [Philbin et al., 2007; Philbin et al., 2008; Weyand et al., 2020; Radenovi c et al., 2018; Ge et al., 2020]. The task of visual search requires to retrieve the same objects images from a large-scale database (dubbed as gallery), Corresponding authors. Work done when Binjie and Yantao are at ARC Lab. (a) Extended-data (c) Extended-class (b) Open-data (d) Open-class (e) Iden3cal-data Old training set New training set Figure 1: Illustration of different training data distributions for universal backward-compatible training. According to the data and category differences between old and new training sets, we summarize the data splits into five types from (a) to (e), covering most of the compatible training scenarios for backfill-free model upgrades. given an image of interest (dubbed as query). The process of offline backfilling * the gallery is always necessary for conventional model upgrades in retrieval systems, which is computationally expensive and time-consuming. Moreover, it is infeasible when the raw images are inaccessible due to privacy issues or storage limitations. Thanks to the introduction of backward-compatible representation learning [Shen et al., 2020; Zhang et al., 2021], new models that are trained with compatibility constraints can be immediately deployed in a backfill-free manner, where the encoded new features for queries are interoperable with the old gallery features. The follow-up works make efforts to further improve the feature compatibility by designing advanced training constraints [Budnik and Avrithis, 2020; Meng et al., 2021] or transformation architectures [Wang et al., 2020]. Positive as the results are, they only focused on a single close-set model upgrading scenario (dubbed as extended-data in Figure 1 (a)), where the new training data share the identical class set as the old one. It is notable that the data split for model upgrades in real-world applications is complex and unpredictable, that is, both close-set and open-set scenarios should be considered. Existing methods [Shen et al., 2020; *Since the upgraded (new) model is not directly comparable with the old gallery features, the gallery needs to be re-extracted via feeding all the raw images into the new model. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) Wang et al., 2020; Budnik and Avrithis, 2020; Meng et al., 2021; Su et al., 2022] did not investigate the open-set data split and are even inapplicable in such a scenario. Towards this end, we for the first introduce the task of universal backward-compatible representation learning, where five kinds of data split covering both close-set and open-set scenarios are considered, as demonstrated in Figure 1. The open-set data split (including extended-class, open-data and open-class) poses a great challenge for learning compatible representations due to the potential domain gaps among different data and categories. To tackle the challenge, we introduce a simple yet effective method, namely Universal Backward-Compatible Training (Uni BCT), to encode compatible representations in all kinds of data splits in a unified manner. Specifically, inspired by [Shen et al., 2020], we utilize the old classifier (in the form of a fully-connected layer) to provide valuable supervision from the old latent space, i.e., enforcing the new features to be closer to their corresponding old class centers. As for the novel categories in the open-class and extended-class scenarios, we extract the features of the new categories images and leverage their class centroids to construct pseudo prototypes. Due to the category gaps [You et al., 2019], the pseudo prototype inevitably carries some noise that may affect the representation learning of backward compatibility. Therefore we propose to improve the class centroids of the pseudo prototype via a novel structural prototype refinement algorithm, i.e., the old features of the new classes images are refined by propagating their neighbors knowledge via a fully-connected graph. The graph works under the assumption that visually similar images (measured by the new model which has the stronger capability) should have close-by old features. In a nutshell, our contributions are three-fold. (1) We introduce a new task, namely universal backward-compatible representation learning, which aims at investigating all possible data splits in practical model upgrading scenarios. (2) We propose a novel method, dubbed as universal backward-compatible training (Uni BCT), to tackle the challenge of different kinds of data splits in a unified manner. Our method is simple yet effective to refine the noisy pseudo prototype and improve the feature compatibility on both close-set and open-set scenarios. (3) We conduct comprehensive experiments on the large-scale face recognition datasets MS1Mv3 [Deng et al., 2019] under five different model upgrading benchmarks , and investigate different compatibility constraints via the evaluations on IJB-C [Maze et al., 2018]. Our Uni BCT consistently outperforms the baseline and other advanced regularizations, fully indicating the effectiveness of our method. 2 Related Work Backward-Compatible Learning. Backward-Compatible Learning aims to make new features and the old ones interoperable and realize backfill-free model upgrades. [Shen et al., 2020] first formulated the problem by deriving influence loss from an empirical criterion, and solved it by utilizing the old classifier to regularize the optimization process. [Wang et al., 2020] proposed Residual Bottleneck Transformation (RBT) blocks for feature embedding transferring. In [Budnik and Avrithis, 2020], authors investigated the problem of the asymmetric test, where the database images are encoded by a teacher model and query images are encoded by a student model. A pair-based metric for instance-level image retrieval was proposed to achieve the goal. [Meng et al., 2021] extended RBT blocks and designed advanced boundary loss to obtain more compact intra-class distributions. Though the above works could properly improve the compatible performance, they severely rely on the old training data or class. The open-set compatible scenarios are never investigated before. Universal Domain Adaptation. While it is true that universal domain adaptation (UDA) [You et al., 2019; Saito et al., 2020] and our universal backward-compatible representation learning both take data/category domain gaps between old and new training data into consideration, they have entirely different purposes. UDA focuses on transferring the model knowledge from the old domain to the new one and only requires the model to perform well on the new domain without any crossdomain operations. Universal backward-compatible learning requires the new model to encode backward-compatible features that can be directly compared with the old features. 3 Universal Backward-Compatible Representation Learning In this section, we first investigate the problem settings of universal backward-compatible representation learning in Sec. 3.1. Then we introduce our universal backwardcompatible training (Uni BCT) method in Sec. 3.2. 3.1 Problem Settings Given the gallery features extracted by the old model, backward-compatible representation learning task requires the trained new model to encode query features that can be directly indexed by the old gallery features. In real-world applications, the new training set may differ from the old one in the aspects of data or classes, raising a universal backward-compatible representation learning problem. Symbol Definition. We denote the training set, gallery set, and query set as D, G, Q. An old model ϕo trained on old training set Do embeds an image x to a feature vector vo = ϕo(x). For model upgrades, a new model ϕn trained on Dn is obtained. The new model ϕn embeds the image x into a new feature vector vn. Benchmarks. Taking both close-set and open-set model upgrading scenarios into consideration, we discuss five kinds of dataset settings as depicted in Table 1: (1) Extended-data: The old training set D30% d o composes of 30% images which are randomly sampled from the whole dataset, and the new training set D100% n is made up of 100% data. The old and new training sets share the same classes. (2) Open-data: The new training data D70% d n and the old data D30% d o are exclusive from each other but they share the same classes. (3) Extendedclass: We randomly pick 30% classes for the old training set D30% c o and 100% classes for the new one D100% n . (4) Open-class: Both the data and the class are different between D30% c o and D70% c n . (5) Identical-data: The new training set D30% d n and the old one D30% d o are identical. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) State 0 State n avg Old model Similarity Measurement Structural Prototype Refinement Prototype Genera6on Training set Prototype Refinement Uni BCT loss Classifica>on loss New feature Old feature Figure 2: Pipeline of our Universal Backward-Compatible Training (Uni BCT). The new model is supervised by a classification loss to learn discriminative features, and an additional universal backward-compatible loss to make sure the new features be interchangeable with the old ones. To alleviate the negative effects of data and category gaps between old and new training data in open-set model upgrading scenarios, we introduce a novel module named structural prototype refinement. It improves the old feature quality by propagating their neighbors knowledge via a fully-connected graph. Note that during the training process, the pseudo prototypes will be not updated by the loss backpropagation. Allocation type Old train-set New train-set # images # classes # images # classes Extended-data 1,511,514 93,431 5,179,510 93,431 Open-data 1,511,514 93,431 3,667,996 93,431 Extended-class 1,549,785 28,029 5,179,510 93,431 Open-class 1,549,785 28,029 3,629,725 65,402 Identical-data 1,511,514 93,431 1,511,514 93,431 Table 1: Five different allocations for the training data, where all the images are sampled from MS1Mv3. The extended-data , opendata and identical-data setups share the same old training set. Compatibility Evaluation. Cross-model compatibility means that the gallery features produced by ϕo can be directly comparable with the query features extracted by ϕn. Following [Shen et al., 2020], we claim that the feature compatibility is achieved if the following empirical criterion is satisfied, M(ϕn, ϕo; Q, G) > M(ϕo, ϕo; Q, G), (1) where M is an evaluation metric for the corresponding test set. Cross Test, denoted as M(ϕn, ϕo; Q, G), is the queryto-gallery retrieval performance, where query features are extracted by new model ϕn and gallery ones with old model ϕo. Self Test reflects the performance where query and gallery features are extracted by the same model (e.g., the old one). 3.2 Universal Backward-Compatible Training To achieve compatibility in new model training process, two universal objectives are essential: (1) obtaining discriminative feature representations for better performance, and (2) making old and new representation features interoperable. The overall training objective of our universal backward-compatible training can be therefore formulated as L = Lcls + ηLuni BCT, (2) where Lcls is the classification loss to achieve the first goal, Luni BCT is the universal backward-compatible loss to achieve the second goal and η is the loss weight. Specifically, following the state-of-the-art method in metric learning, we use the form of Arc Face loss [Deng et al., 2019] to regularize the pretext task of classification, that is, Lcls = ℓarc(ωn, ϕn), (3) where ωn and ϕn denote the classifier and backbone of the new model. The formulation of Arc Face loss is ℓarc(ω, ϕ) x Dn log es(cos(θy+m)) es(cos(θy+m)) + P j =y es cos θj , (4) where y is the label of the training image x. s is a scale factor, m is the margin, and θy = arccos ωy, ϕ(x) is the angle between the weight ωy (y-th prototype of the classifier ω) and the feature ϕ(x). With Lcls, the new model can be properly trained to encode discriminative representations for self test. According to [Shen et al., 2020], the old classifier (on top of the old backbone model) embeds the characteristic (i.e., class prototypes) of the old latent space, which can be directly leveraged as the valuable supervision in close-set compatible training. However, in the open-set benchmarks of our universal backward-compatible representation learning task, the off-theshelf old classifier is inapplicable due to the novel new classes. Intuitively, to overcome this limitation, we can modify the offthe-shelf old classifier into a pseudo classifier via (1) extracting the features of the new training set by the old model, and (2) using their class centroids as the pseudo classifier weights. We denote the pseudo old classifier as ˆωo, and the backwardcompatible loss can be formulated as Luni BCT = ℓarc(ˆωo, ϕn), (5) Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) where ℓarc is the form of Arc Face loss. Luni BCT regularizes to push the new features be closer to their corresponding old class centroids in order to align the old and new latent spaces. It is notable that the quality of pseudo old prototypes is essential to the training of feature backward compatibility. Due to the domain gap (including data gap and category gap) between old and new training sets in open-set model upgrading scenarios, the pseudo old prototypes generated by the simple average operation inevitably carry some noise, affecting the representation learning of backward compatibility. To tackle the challenge, we introduce a novel structural algorithm to refine the prototypes via a fully-connected graph. Structural Prototype Refinement. As illustrated in Figure 2, we improve the old prototypes through knowledge propagation under the assumption that visually similar samples of the same class should have close-by old features. We use the training new model to measure their similarities since the new model is expected to have the stronger model capability and could encode more discriminative representations for more accurate similarity measurement. Specifically, we construct a fully-connected undirected graph G = (V, E) for each class, where V and E represent its vertices and edges. In our context, each old feature vo Rd serves as a vertex, and features of the same class can be denoted as a matrix V Rm d, where d is the feature dimension and m is the sample number for a certain class. The edges among vertices are the similarity scores between pairwise samples, which are measured by cosine similarity, i.e., vi n, vj n . Note that we use new model features to measure the similarity. All the edges of a graph G can be denoted as a symmetric matrix E. And we further normalize it by row, ( exp( vi n,vj n /τ) P j =i exp( vi n,vj n /τ), i = j 0, i = j (6) where τ is the temperature hyper-parameter and a lower temperature leads to a sharper probability distribution. Each node in the graph randomly visits neighbor images driven by transition probabilities (i.e., similarity scores). Similar nodes (neighbors) are enhanced by each other and closer to the real center of the current class. The outlier features would also be rectified by other nodes. Such a propagation process can be formulated as, V (t) = EV (t 1), (7) where t is the iteration times. The initial feature matrix V (0) is aggregated to avoid potential collapse in the propagation process, that is, V (t) = λ EV (t 1) + (1 λ)V (0), (8) where λ [0, 1] is the aggregation weight. When t tends to infinity, Eq. (8) has a converged close form, V ( ) = (1 λ)(I λ E) 1V (0), (9) where I is an identity matrix and ( ) 1 denotes matrix inverse operation. Once V is obtained, the class prototype ˆωo could be computed by column-wise average pooling of V , i=1 V ( )(i, :), (10) where ˆωo(j) Rd is the j-th pseudo prototype, and m is the number of vertices belonging to the j-th class. The refined class prototype ˆωo are used as supervision signals in the universal backward compatible loss Luni BCT (Eq. (5)). Compared with the vanilla average-based prototype, our introduced structural prototype refinement effectively alleviates the outlier effects by propagating and aggregating the knowledge from neighbor features of the same class. 4 Experiments To perform a thorough evaluation, we estimate our method (Uni BCT) under all compatible settings on the large-scale face recognition dataset. Satisfying results indicates the effectiveness and robustness of our approach. 4.1 Experimental Setup Datasets. MS-Celeb-1M (MS1M) [Guo et al., 2016] is a large-scale face recognition training dataset, which consists of about 10 million images with 1 million identities. Since the original MS1M dataset includes abundant noisy images, we adopt MS1Mv3 [Deng et al., 2019] as the training set, which is made up of 5,179,510 training images with 93,431 labels. IJB-C [Maze et al., 2018], a challenging benchmark, is utilized as the open-set evaluation dataset, which has around 1.3 million images. For verification task, there are 469,376 templates pairs. For identification task, the query set contains 19,593 images and the gallery set consists of 3,531 images. Metric. We employ two standard test protocols in face recognition: (1) 1:1 verification calculates the true acceptance rate (TAR) at different false acceptance rates (FAR) for template pairs. In Cross-Test, we extract the first template with the new model, and the second with the old model. (2) 1:N identification evaluates the retrieval accuracy at top-k. In Cross-Test, we process the query set (prob images) and the gallery set (template images) with the new and old model, respectively. Training Details. We use 4 NVIDIA V100 GPUs for training. The training index file is split with fixed random seed 666. We adopt Res Net18 and Res Net50 [He et al., 2016] architectures as the backbones of the old and new models; one Fully Connected layer is followed to project the output dimension into 512. We adopt standard stochastic gradient descent (SGD) to optimize the model parameters. The learning rate is set to 0.1 and decreases 10 times at the 20th, 26th and 32th epoch. The training stops after 35 epochs. The weight decay is set to 10 4 and momentum is 0.9. Batch size is set to 256. The scale factor s and margin m in Eq. 4 are 64, 0.5 following the default setting . In graph-based prototype refinement, we set λ to 0.9, T to 0.05. 4.2 Analysis of Uni BCT Effectiveness of Structural Prototype Refinement. Since the quality of pseudo prototypes has essential impact on the backward-compatible learning, we introduce a structural prototype refinery mechanism to improve old features by allocating knowledge from their neighbors. As illustrated in Table 2, our method (Luni BCT) not only fulfills the requirement of feature https://github.com/deepinsight/insightface Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) Scenarios Modelold Modelnew Training Set Comp. Loss 1:1 Verification 1:N Identification Cross Test Self Test Cross Test Self Test TAR@FAR TAR@FAR Top1 Top5 Top1 Top5 Extended-data ϕr18 o D30% d o - - 93.36 - - 69.90 75.88 ϕr18 n D100% n - - 96.35 - - 80.67 85.14 ϕr18 o ϕr18 n D100% n Lregress 0.12 94.78 8.12 10.43 76.34 80.88 ϕr18 o ϕr18 n D100% n Lcontra 92.26 94.58 73.36 81.35 80.90 85.99 ϕr18 o ϕr18 n D100% n L uni BCT 93.88 94.62 72.46 81.25 80.51 84.78 ϕr18 o ϕr18 n D100% n Luni BCT 94.13 94.85 72.89 81.77 80.83 85.95 ϕr18 o D30% d o - - 93.36 - - 69.90 75.88 ϕr18 n D70% d n - - 94.28 - - 75.55 80.24 ϕr18 o ϕr18 n D70% d n Lregress 0.02 94.51 7.36 9.12 73.21 78.84 ϕr18 o ϕr18 n D70% d n Lcontra 92.23 94.42 70.34 78.20 76.69 81.75 ϕr18 o ϕr18 n D70% d n L uni BCT 93.75 94.37 70.35 77.68 76.54 81.69 ϕr18 o ϕr18 n D70% d n Luni BCT 94.18 94.52 71.42 79.14 76.88 81.92 Extended-class ϕr18 o D30% c o - - 92.95 - - 68.84 74.72 ϕr18 n D100% n - - 96.35 - - 80.67 85.14 ϕr18 o ϕr18 n D100% n Lregress 0.08 93.21 7.55 9.67 74.15 78.72 ϕr18 o ϕr18 n D100% n Lcontra 92.70 94.53 71.83 79.26 78.43 83.76 ϕr18 o ϕr18 n D100% n L uni BCT 93.54 94.32 71.67 79.33 78.51 84.14 ϕr18 o ϕr18 n D100% n Luni BCT 93.75 94.55 72.02 79.13 78.84 84.33 ϕr18 o D30% c o - - 92.95 - - 68.84 74.72 ϕr18 n D70% c n - - 94.28 - - 75.55 80.24 ϕr18 o ϕr18 n D70% c n Lregress 0.01 92.78 6.88 8.12 70.26 75.95 ϕr18 o ϕr18 n D70% c n Lcontra 92.51 94.24 66.51 75.82 73.63 79.96 ϕr18 o ϕr18 n D70% c n L uni BCT 93.35 93.96 67.14 76.38 74.21 80.28 ϕr18 o ϕr18 n D70% c n Luni BCT 93.46 94.10 67.47 77.01 74.79 81.22 Identical-data ϕr18 o D30% d o - - 93.36 - - 69.90 75.88 ϕr50 n D30% d n - - 94.97 - - 70.21 76.34 ϕr18 o ϕr50 n D30% d o Lregress 0.11 93.78 7.73 9.35 67.41 73.43 ϕr18 o ϕr50 n D30% d o Lcontra 92.53 95.58 64.67 74.69 70.49 77.13 ϕr18 o ϕr50 n D30% d o L uni BCT 94.40 95.42 67.38 73.25 70.57 78.34 ϕr18 o ϕr50 n D30% d o Luni BCT 94.59 95.63 67.71 73.81 70.66 78.76 Table 2: Comparison of baselines and our proposed approach on IJB-C dataset in universal backward-compatible scenarios, including five different benchmarks. The architectures are Res Net18 (r18) and Res Net50 (r50). L uni BCT denotes the vanilla version of universal backward-compatible loss where the pseudo prototypes are simply averaged over the raw old features. Luni BCT uses our introduced structural prototype refinement algorithm to improve the pseudo classifier and achieves the optimal performance. We evaluate all models in two aspects: (1) For 1:1 verification, the first and second templates are extracted by the new and old model in Cross-Test (CT), and they are processed by the same new model in Self-Test (ST). TAR@FAR=1e 4 is adopted as the compatible metric. (2) For 1:N Identification, the query and gallery set are extracted by the new and old models respectively in CT. We report the retrieval accuracy in terms of top1 and top5. compatible training, but also well boosts the baseline method (L uni BCT), which adopts vanilla prototypes for training. In addition, an alternative approach for refining pseudo prototype is to discard outlier samples which are far away from the class centroids. Specifically, we filter out the top-10% data that is away from the mean feature vector in each class, and utilize the rest features to generate the class prototype. As shown in Table 3, it (denoted as drop avg. ) performs worse than the proposed graph-based refinement (denoted as refined avg. ). That is because the distribution of the old features is noisy and unreliable, the drop strategy only refers to the old distribution while our structural refinement utilizes the sample similarities in the new latent space as propagation guidance. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) Method Prototype 1:1 Verification 1:N Identification TAR@FAR Comp.? Top1 Top5 Comp.? ϕr18 o - 93.36 - 69.90 75.88 - Ours vanilla avg. 93.88 72.46 81.25 Ours drop avg. 94.03 72.35 80.97 Ours refined avg. 94.13 72.89 81.77 Table 3: The comparison of different prototypes for the old pseudo classifier. Refined avg. denotes our optimal solution of structural prototype refinement. The results are reported on IJB-C (extendeddata) in terms of 1:1 verification (TAR@FAR=1e 4). Compare to Other Form of Constraints. The old prototype represents the global contents of the old model, in the meanwhile, each old feature indicates local details. Therefore, directly maximizing the similarity between the new feature and the corresponding old feature is an alternative choice to achieve compatibility. Specifically, one direct way is to minimize the Euclidean distance between the old and new features extracted from the same image: Lregress(ϕn, ϕo) = 1 x Dn ϕn(x) ϕo(x) 2. (11) As demonstrated in Table 2, we notice that feature regression fails in all settings. The reason might be that simply minimizing the distance between positive pairs is not enough. Thus we turn to another solution, i.e., pulling the new-old positive pairs close and pushing away the negative pairs in the form of contrastive learning. Considering each new feature (ϕn(xi), yi) as the anchor, the positive set consists of old features with the same class P(i) = {ϕo(xj)|yj = yi}, and the negative set is comprised of the other old features N(i) = {ϕo(xj)|yj = yi}. To simplify the training process, we only consider one positive pair (ϕn(xi), ϕo(xi)). The compatible loss is formulated as, Lcontra(ϕn, ϕo) xi Dn log e(ϕn(xi) ϕo(xi)/τ) P k {xi,N (i)} e(ϕn(xi) ϕo(k)/τ) , (12) where τ is a temperature hyper-parameter. As shown in Table 2, the performance of Uni BCT surpasses the other losses in terms of Cross Test and Self Test. Our Uni BCT loss adopts the classification-like form following [Shen et al., 2020] and it considers global intra-class and inter-class relations. In contrast, the contrastive loss (Eq. (12)) only considers the classes in the current mini-batch, neglecting the global information. Close-set v.s.Open-set. For 1:1 verification task, our method achieves remarkable performance in all close-set and open-set scenarios. For 1:N identification task, the empirical criterion (Eq. (1)) is satisfied in most practical settings, except for the most challenging scenario (i.e., open-class), demonstrating that Uni BCT can properly alleviate the category gap but cannot entirely solve it. Even though, we still outperforms other competing methods, indicating the effectiveness of Uni BCT. Method Comp. Loss 1:1 Verification 1:N Identification TAR@FAR Comp.? Top1 Top5 Comp.? ϕr18 o - 93.36 - 69.90 75.88 - AML Lregress 0.12 8.12 10.43 BCT LBCT 94.01 72.64 81.49 Ours L uni BCT 93.88 72.46 81.25 Ours Luni BCT 94.13 72.89 81.77 Table 4: Compare to state-of-the-art backward-compatible training methods on IJB-C (extended-data). Only extended-data is evaluated here since BCT is inapplicable for other open-set benchmarks. The results are reported in terms of 1:1 verification (TAR@FAR=1e 4). 4.3 Comparison with State-of-the-arts To indicate our approach Uni BCT can consistently surpasses previous compatible training methods in the conventional close-set benchmarks, we conduct the comparison experiments on the extended-data setup. As shown in Table 4, we compare with BCT [Shen et al., 2020] and AML [Budnik and Avrithis, 2020]. Note that [Wang et al., 2020] and [Meng et al., 2021] are not listed as they require extra network parameters which is not fair. AML aims to enlarge the similarity of positive pairs, which is the same as the regression loss in Eq. (11). AML fails to achieve compatibility in face recognition task though it works well in landmark retrieval in its original paper. Regression loss only focuses on decreasing the distance between positive pairs while ignoring the distance restriction between negative pairs, leading to unsatisfactory performance in fine-grained retrieval tasks, like face recognition. As we introduced in the method section, in the close-set setup, the off-the-shelf old classifier can directly serve as the old prototypes according to [Shen et al., 2020]. To first investigate the difference between the off-the-shelf old classifier (ωo) and the pseudo classifier (ˆωo), we compare LBCT and L uni BCT (vanilla avg.) in Table 4. It is notable that the vanilla prototype achieves comparable performance with minor sacrifice, indicating that the class centers of the pseudo classifier may not be as real as those of the trained classifier. However, with our structural prototype refinement method, Uni BCT well surpasses the original BCT, which further demonstrates the effectiveness of our method. 5 Conclusion We for the first time introduce the task of universal backwardcompatible representation learning, which covers both closeset and open-set compatible training scenarios for real-world model upgrades. To tackle the challenge of noisy old prototype features, we propose a simple yet effective method, namely Uni BCT, to properly refine the prototypes by propagating and aggregating their neighbors knowledge. Uni BCT trains the new models to encode discriminative and compatible representations in five different benchmarks in a unified manner. It is the first step towards universal compatible feature learning, and there s still a long way to go for totally solving this problem. Further studies are called for. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) Acknowledgements This work was supported by NSFC project Grant No. U1833101, SZSTI Grant No. JCYJ20190809172201639, ZDSYS20210623092001004 and WDZC20200820200655001, the Joint Research Center of Tencent and Tsinghua. [Budnik and Avrithis, 2020] Mateusz Budnik and Yannis Avrithis. Asymmetric metric learning for knowledge transfer. ar Xiv preprint ar Xiv:2006.16331, 2020. [Deng et al., 2019] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In CVPR, 2019. [Ge et al., 2020] Yixiao Ge, Haibo Wang, Feng Zhu, Rui Zhao, and Hongsheng Li. Self-supervising fine-grained region similarities for large-scale image localization. In ECCV, 2020. [Ghifary et al., 2016] Muhammad Ghifary, W Bastiaan Kleijn, Mengjie Zhang, David Balduzzi, and Wen Li. Deep reconstruction-classification networks for unsupervised domain adaptation. In ECCV. Springer, 2016. [Guo et al., 2016] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In ECCV, 2016. [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. [Li and Hoiem, 2017] Zhizhong Li and Derek Hoiem. Learning without forgetting. PAMI, 2017. [Liu et al., 2017] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In CVPR, 2017. [Maze et al., 2018] Brianna Maze, Jocelyn Adams, James A Duncan, Nathan Kalka, Tim Miller, Charles Otto, Anil K Jain, W Tyler Niggel, Janet Anderson, Jordan Cheney, et al. Iarpa janus benchmark-c: Face dataset and protocol. In ICB, 2018. [Meng et al., 2021] Qiang Meng, Chixiang Zhang, Xiaoqiang Xu, and Feng Zhou. Learning compatible embeddings. ICCV, 2021. [Philbin et al., 2007] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching. In 2007 IEEE conference on computer vision and pattern recognition. IEEE, 2007. [Philbin et al., 2008] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In 2008 IEEE conference on computer vision and pattern recognition. IEEE, 2008. [Radenovi c et al., 2018] Filip Radenovi c, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondˇrej Chum. Revisiting oxford and paris: Large-scale image retrieval benchmarking. In CVPR, 2018. [Saito et al., 2020] Kuniaki Saito, Donghyun Kim, Stan Sclaroff, and Kate Saenko. Universal domain adaptation through self supervision. In Neur IPS, 2020. [Shen et al., 2020] Yantao Shen, Yuanjun Xiong, Wei Xia, and Stefano Soatto. Towards backward-compatible representation learning. In CVPR, 2020. [Su et al., 2022] Shupeng Su, Binjie Zhang, Yixiao Ge, Xuyuan Xu, Yexin Wang, Chun Yuan, and Ying Shan. Privacy-preserving model upgrades with bidirectional compatible training in image retrieval. ar Xiv preprint ar Xiv:2204.13919, 2022. [Wang et al., 2017] Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. Normface: l 2 hypersphere embedding for face verification. In ACMMM. ACM, 2017. [Wang et al., 2018] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In CVPR, 2018. [Wang et al., 2020] Chien-Yi Wang, Ya-Liang Chang, Shang Ta Yang, Dong Chen, and Shang-Hong Lai. Unified representation learning for cross model compatibility. ar Xiv preprint ar Xiv:2008.04821, 2020. [Weyand et al., 2020] Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. Google landmarks dataset v2-a largescale benchmark for instance-level recognition and retrieval. In CVPR, 2020. [Yang et al., 2017] Jiaolong Yang, Peiran Ren, Dongqing Zhang, Dong Chen, Fang Wen, Hongdong Li, and Gang Hua. Neural aggregation network for video face recognition. In CVPR, 2017. [You et al., 2019] Kaichao You, Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Universal domain adaptation. In CVPR, 2019. [Zhang et al., 2020] Xiao Zhang, Rui Zhao, Yu Qiao, and Hongsheng Li. Rbf-softmax: Learning deep representative prototypes with radial basis function softmax. pages 296 311. Springer, 2020. [Zhang et al., 2021] Binjie Zhang, Yixiao Ge, Yantao Shen, Yu Li, Chun Yuan, XUYUAN XU, Yexin Wang, and Ying Shan. Hot-refresh model upgrades with regression-free compatible training in image retrieval. In International Conference on Learning Representations, 2021. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)