# clusteredpatch_element_connection_for_fewshot_learning__3db8b6e6.pdf

Clustered-patch Element Connection for Few-shot Learning

Jinxiang Lai1 , Siqian Yang1 , Junhong Zhou2 , Wenlong Wu1 , Xiaochen Chen1 , Jun Liu1 , Bin-Bin Gao 1 , Chengjie Wang 1,3

1Tencent Youtu Lab, China 2Southern University of Science and Technology, China 3Shanghai Jiao Tong University, China layjins1994@gmail.com, {seasonsyang, ezrealwu, husonchen}@tencent.com, 12011801@mail.sustech.edu.cn, {junsenselee, csgaobb}@gmail.com, jasoncjwang@tencent.com

Weak feature representation problem has influenced the performance of few-shot classification task for a long time. To alleviate this problem, recent researchers build connections between support and query instances through embedding patch features to generate discriminative representations. However, we observe that there exists semantic mismatches (foreground/ background) among these local patches, because the location and size of the target object are not fixed. What is worse, these mismatches result in unreliable similarity confidences, and complex dense connection exacerbates the problem. According to this, we propose a novel Clustered-patch Element Connection (CEC) layer to correct the mismatch problem. The CEC layer leverages Patch Cluster and Element Connection operations to collect and establish reliable connections with high similarity patch features, respectively. Moreover, we propose a CECNet, including CEC layer based attention module and distance metric. The former is utilized to generate a more discriminative representation benefiting from the global clustered-patch features, and the latter is introduced to reliably measure the similarity between pair-features. Extensive experiments demonstrate that our CECNet outperforms the state-ofthe-art methods on classification benchmark. Furthermore, our CEC approach can be extended into few-shot segmentation and detection tasks, which achieves competitive performances.

1 Introduction In contrast to general deep learning task [Krizhevsky et al., 2012], Few-Shot Learning (FSL) aims to learn a transferable classifier with amount seen images (base class) and few labeled unseen images (novel class). Due to the lack of effective features from unseen classes, a robust feature embedding model is indispensable. Recent researchers[Hou et al., 2019; Rizve et al., 2021; Xu et al., 2021a] manage to design an embedding network for generating more discriminative features.

Corresponding Author

Traditional Cross Attention

Our Clustered-patch Element Connection

(Local info)

(Global info)

Patch Cluster

(Local info)

Reliable & Concise connection

Unreliable & Redundant connection

Patch Dense Connection

Collect Similar Patches

Figure 1: Comparison between traditional Cross Attention and our Clustered-patch Element Connection. The proposed Clusteredpatch Element Connection, which utilizes the global info Cp integrated from support feature P to perform element connection with query Q leading to a confident and clear connection, is able to generate a more clear and precise relation map than Cross Attention. The detailed Patch Cluster operation is illustrated in Fig.2. The visualization comparisons are referred to Fig.4(a).

Specifically, cross attention based methods [Hou et al., 2019; Xu et al., 2021a; Xu et al., 2021b] focus on reducing the background noise and highlighting the target region to generate more discriminative representations. The core idea of these methods is to divide extracted features into patches and connect all local patch features. However, as shown in Fig.1, we observe that the target object may be located randomly with different scales among the query images. Hence, these methods suffer two main problems: inconsistent seman-

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

tic in feature space, and unreliable and redundant connections. To tackle these problems, we propose a Clustered-patch Element Connection (CEC) layer which consists of Patch Cluster and Element Connection operations. In detail, given inputs features P (support) and Q (query), CEC layer firstly obtains the global clustered-patch Cp features by Patch Cluster operation as illustrated in Fig.2, then performs Element Connection on Q by using Cp, and finally produces a more discriminative representation Q. Patch Cluster aims to collect the objects in source feature P that are similar to the reference patch in Q, which adaptively alignments the P into CP to obtain a consistent semantic feature for each patch of Q. Then, with the global clustered-patch features, CEC layer generates more reliable and concise connections than cross attention. According to CEC layer, we find the key of generating accurate relation map is to obtain appropriate clustered-patch features. In this paper, four solutions are introduced to perform Patch Cluster, including Mat Mul, Cosine, GCN and Transformer. Different from the naive Mat Mul and Cosine modes, we propose the meta-GCN and Transformer based Patch Cluster operations to obtain a more robust clusteredpatch by implementing additional feature refinement. The insight of meta-GCN is constructing a dynamic correlationbased adjacent for each current input pair-features, other than the static GCN [Kipf and Welling, 2017] using a fixed adjacent. Besides, the transformer structure obtains global information via modeling a spatio-temporal correlation among instances, which generates a more accurate relation map. Along with the description of CEC mechanism, we propose three CEC-based modules: (I) The Clustered-patch Element Connection Module (CECM) distinguishes the background and the object for each image pair (support and query) at the feature level adaptively, which gives a more precise highlights at the regions of target object; (II) The Self-CECM enhances the semantic feature of target object in a self-attention manner to make the representation more robust; (III) The Clusteredpatch Element Connection Distance (CECD) is a CEC-based distance metric which measures the similarity between pairfeatures via the obtained reliable relation map. For few-shot classification task, we introduce a novel Clustered-patch Element Connection Network (CECNet) as illustrated in Fig.3, which learns a generalize-well embedding benefiting from auxiliary tasks, generates a discriminative representation via CECM and Self-CECM, and measures a reliable similarity map via CECD. Furthermore, we derive a novel CEC-based embedding module named CEC Embedding (CECE), which can be applied into few-shot semantic segmentation (FSSS) and few-shot object detection (FSOD) tasks. We simply stack the proposed CECE after the backbone network of the existing FSSS and FSOD methods, which achieves consistent improvements around 1% 3%. To summarize, our main contributions are: We propose a Clustered-patch Element Connection (CEC) layer to strengthen the target regions of query features by element-wisely connecting them with the global clusteredpatch features. Four different CEC modes are introduced, including Mat Mul, Cosine, GCN and Transformer. We derive three CEC-based modules: CECM and Self CECM modules are utilized to produce more discriminative

Low similar

High similar

Element Connection

Patch Cluster

Figure 2: Patch Cluster and Element Connection.

representations, and CECD is able to measure a reliable similarity map. With CEC-based modules and auxiliary tasks, a novel CECNet model is designed for few-shot classification. CECNet improves state-of-the-arts on few-shot classification benchmark, and the experiments demonstrate that our method is effective in FSL. Furthermore, our CECE (i.e. CEC-based embedding module) can be extended into few-shot segmentation and detection tasks, which achieves performance improvements around 1% 3% on the corresponding benchmarks.

2 Related Work Few-Shot Learning The FSL algorithms aim to recognize novel categories with few labeled images, and a categorydisjoint base set with abundant images is provided for pretraining. The classic FSL tasks include few-shot classification [Finn et al., 2017; Vinyals et al., 2016; Snell et al., 2017; Hou et al., 2019; Tian et al., 2020], semantic segmentation [Zhang et al., 2020b; Siam et al., 2019; Malik et al., 2021] and object detection [Kang et al., 2019; Wang et al., 2020; Qiao et al., 2021]. More introductions are presented in APPENDIX. In a word, the existing FSL methods lack a uniform function to control the connections among the patches between support and query instances semantically. Other Related Works are introduced in APPENDIX, such as Auxiliary Task for FSL [Hou et al., 2019; Rizve et al., 2021], Graph Convolutional Network (GCN) [Bruna et al., 2013], and Transformer [Vaswani et al., 2017].

3 Problem Definition 3.1 Few-Shot Classification A classic few-shot classification problem is specified as a N-way K-shot task, which means solving a N-class classification problem with only K labeled instances provided per class. In the recent investigations[Hou et al., 2019; Snell et al., 2017], the source dataset is divided into three category-disjoint parts: training set Xtrain, validation set Xval and test set Xtest. Moreover, the episodic training

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

mechanism is widely adopted. An episode consists of two sets (randomly sampling in N categories): support and query. Let S = {(xs i, ys i )}ns i=1 (ns = N K) denote the support set, and Q = {(xq i , yq i )}nq i=1 denote the query set. Note that ns and nq are the size of corresponding sets. Especially, S = {S1, S2, ..., Sk}, where Sk denotes the support set of the kth category in S. Specifically, let (P, Q) Rhw c denote the support and query features, which are extracted from support subset and query instance (Sk, xq). Note that c, h, w are channel number, height, width of features, respectively.

3.2 Cross Attention The traditional Cross Attention [Hou et al., 2019] proves that highlighting target regions could generate more discriminative representations, leading to accuracy improvements for FSL. The key is to generate an fine-grained relation map RQ Rhw to represent the target regions in Q Rhw c. Then, a spatial-wise feature attention can be obtained through RQ Q, where is the Element-wise Product. The traditional Cross Attention produces relation map RQ for Q by:

where h is a CNN-based layer to refine the correlation matrix P ||P ||2 ( Q ||Q||2 )T Rhw hw. According to Eq.1, the Cross Attention produces relation map by Local-to-Local fully connection among local feature patches of P and Q. In detail, (Pi, Qj) R1 c represent a pair of support feature patch and query feature patch among (P, Q) Rhw c . As shown in Fig.1, the target object may be located unregularly among the query images at different scale, which results in inconsistent semantic in feature space, i.e feature patches Pi and Qj may be semantically inconsistent. This semantically inconsistent problem causes low confident correlation between patches, and the complex Local-to-Local fully connection further accumulates this inaccurate bias, which affect the quality of the generated relation map. To establish concise and clear connections among global and local features, we propose a Clustered-patch Element Connection layer (CEC), which consists of two key operations: Patch Cluster and Element Connection.

4 Clustered-patch Element Connection

4.1 Patch Cluster

As illustrated in Fig.2, Patch Cluster operation obtains a set Cp, named Clustered-patch, via collecting those objects in support feature set P, which are similar to the reference patch in Q. We define a generic Patch Cluster operation f P C as:

Cp = f P C(Q, P) = ϕ (g (Q, P) P) . (2)

Here P is the input source feature, Q is the input reference feature, and Cp Rhw c is the output Clustered-patch. A pairwise function g computes an affinity matrix representing relationship between Q and P. The clustered patches can be refined by function ϕ. In detail, we divide the source image into w h patches. Here, w and h are the same as the size

of the features in P, which is convenient for element connection operation. A Clustered-patch Cp Rhw c collects w h clusters. Each cluster collects the patch-features in P = [P1, P2, ..., Phw] Rhw c that are similar to the corresponding patch-feature in the reference patch-features in Q. Therefore, Cp is semantically similar to Q. To implement the Patch Cluster operation, we give four solutions including Mat Mul, Cosine, GCN and Transformer. Mat Mul A simplest way to obtain the clustered patches is treating Mat Mul operation as the pairwise function g (in Eq.(2)) and not implementing any further embedding refinement. Formally,

Cp = σ QP T P, (3)

where σ is softmax function. Cosine A simple extension of the Mat Mul version is to compute cosine similarity in feature space. Formally,

GCN GCN [Kipf and Welling, 2017] updates the input features P via utilizing a pre-defined adjacent matrix A Rhw hw and a learnable weight matrix W Rc c. Formally, the updated features Gp Rhw c can be expressed as: Gp = δ(APW), where δ( ) is the nonlinear activation function (Sigmoid( ) or Re LU( )). However, the adjacent matrix A used in GCN is fixed for all inputs after training, which is not able to recognize the new categories in few-shot task. Comparing Eq.(2) and the definition of GCN, we observe that the affinity matrix g (Q, P) can be considered as the adjacent matrix A, because they all try to describe the relationship between features P and Q. Hence, we derive a meta-GCN through replacing the static adjacent matrix with the dynamic affinity matrix. Formally, the meta-GCN based Patch Cluster operation is derived as follows:

Transformer The Transformer[Vaswani et al., 2017] based Patch Cluster operation is defined as follows:

Cp = FFN{σ[(Wq Q)(Wk P T )]Wv P}, (6)

where, FFN is the Feed-Forward Network in transformer, Wq, Wk, Wv are learnable weights (e.g. convolution layers).

4.2 Element Connection According to the global semantic features Cp obtained from Patch Cluster operation, element Connection operation generates the relation map RQ for Q by simply computing the patch-wise cosine similarity between Q and Cp. Finally, we obtain a rectified discriminative representation by the Element Connection operation f EC:

Q = f EC(Q, Cp) = σ RQ + 1 Q,

where, RQ = Q ||Q||2 Cp

where, is Patch-wise Dot Product, is Element-wise Product. The nth position of RQ is RQ n = Qn ||Qn||2 Cp n ||Cp n||2 , where

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

is Dot Product. The visualizations of the CEC-based relation map RQ are shown at the last column in Fig.4(b). Overall, the Clustered-patch Element Connection (CEC) layer is able to highlight the regions of Q that are semantically similar to P. Formally, CEC layer f CEC is expressed as:

Q = f CEC(Q, P) = f EC (Q, f P C(Q, P)) . (8)

4.3 Discussion Compared with traditional Cross Attention, the key point of our Clustered-patch Element Connection is to perform the Global-to-Local element connection between the Clusteredpatch Cp (global) and query Q (local). It is able to generate a more clear and precise relation map, as shown in Fig. 4(a) visualizations. As demonstrated in Tab. 2, our CEC-based approach achieves 4% accuracy improvement than the traditional Cross Attention based CAN [Hou et al., 2019]. Generally, the advantages of our Clustered-patch Element Connection are: (I) The relation map generated by Element Connection is more confident than Cross Attention, because the global Clustered-patch feature Cp is more stable and representative than the local feature P. (II) Element Connection (1-to-1 patch-connection) has more clear connection relationship than Cross Attention (1-to-hw patch-connection). Moreover, the respective advantages of different solutions for realizing Patch Cluster are: (I) These four solutions can be divided into two groups: fixed (i.e. Mat Mul and Cosine) and learnable (i.e. GCN and Transformer) solutions. The fixed solutions can be used to perform patch clustering without additional learnable parameters, while the learnable solutions are data-driven to refine the affinity matrix or clustered-patch. (II) According to experimental results in Tab. 2, the learnable solutions are better than the fixed ones when they are applied as a embedding layer for feature enhancing (i.e. CECM defined in Eq. 9), which indicates that the learnable solutions can generate better embedding features. In contrast, according to Tab. 3, the fixed solutions are better than the learnable ones when they are applied as the distance metric for measuring similarity (i.e. CECD defined in Eq. 11), which indicates fixed solutions can obtain more reliable similarity scores.

5 CEC Network for Few-Shot Classification 5.1 CEC Module and Self-CEC Module According to the CEC layer mentioned above, we propose two derivative modules: the CEC Module (CECM) and the Self-CEC Module (self-CECM). The CECM is able to highlight the mutual similar regions via learning the semantic relevance between pair feature. Specifically, CECM transfers the input pair-features (P, Q) Rhw c into more discriminative representations ( P, Q) Rhw c. Formally, its function f CECM is expressed as:

( Q, P) = f CECM(Q, P),

where, Q = f CEC(Q, P), P = f CEC(P, Q). (9)

The Self-CECM enhances the semantic feature of target object via self-connection, which turns the input Q into

Q Rhw c. Formally, Self-CECM function f SCECM is expressed as:

Q = f SCECM(Q) = f CEC(Q, Q). (10)

The CECM exploit the relation between P and Q via Q = f CEC(Q, P), while Self-CECM exploit the relation between the input itself via Q = f CEC(Q, Q), i.e. Self-CECM explores the relation between the patches of input image. Because we assume that patch-features of the target are mutually similar, Self-CECM can enhance the target region by clustering the similar regions.

5.2 CECNet Framework Then, we give the overall Clustered-patch Element Connection Network (CECNet). The framework is shown in Fig.3, which integrates CECM, Metric Classifier and Finetune Classifier for few-shot classification task, and Rotation Classifier and Global Classifier for the auxiliary tasks. The network involves three stages: Base Training, Novel Finetuning and Novel Inference. Base Training As illustrated in Fig.3, every image xq in query set Q = {(xq i , yq i )}nq i=1 is rotated with [0 , 90 , 180 , 270 ] and outputs a rotated Q = {( xq i , yq i )}nq 4 i=1 . The support subset Sk and the rotated query instance xq are processed by the embedding fθ and produces the prototype feature P k = 1 |Sk| P xs i Sk fθ(xs i) and query feature Q = fθ( xq) Rc h w, respectively. Then each pair-features (P k, Q) are processed via CECM to enhance the mutually similar regions and generates more discriminative features ( P k, Qk) for the subsequent classification. Note that the inputs and outputs of CECM will be reshaped to satisfied its format. Finally, CECNet is optimized via multi-task loss contributing from metric classifier and auxiliary tasks. Novel Fine-tuning The Fine-tune Classifier consists of Self CECM and a linear layer as shown in Fig.3. In fine-tuning phase, the pre-trained embedding fθ is frozen, and the Finetune Classifier is optimized with cross-entropy loss. Novel Inference In inductive inference, the overall prediction of CECNet is Y = YM + YF , where YM and YF are the results of Metric and Fine-tune Classifiers respectively.

5.3 Metric Classifier As illustrated in Eq. 7, the proposed CEC layer is able to generate a reliable relation map RQ. The relation map RQ can also be utilized as a similarity map, and the mean of RQ is the similarity score. Therefore, we obtain the CECD distance metric d CECD which is expressed as:

d CECD( Q, P) = Q || Q||2 C p

With the proposed CECD distance metric, the Metric Classifier make predictions by measuring the similarity between the query and the N support classes. Following [Hou et al., 2019], the patch-wise classification strategy is used to produce precise feature representations. In detail, each patchwise feature Qk n at nth spatial position of Qk, is recognized as N classes. And the probability of predicting Qk n as kth

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

Support set

Global Classifier

Rotation Classifier

Metric Classifier

Multi-task loss

Self-CECM Fine-tune Classifier

Figure 3: The proposed CECNet framework. The CECM is able to highlight the mutually similar regions, the CECD is utilized to measure similarity of pair-features. And Self-CECM enhances the semantic feature of target object via self-connection.

ˆY (y = k| Qk n) = exp (Rk n) PN i=1 exp (Rin) ,

where, Rk = d CECD( Qk, P k) Rhw,

where, the similarity map Rk is obtained by the CECD distance metric formulated in Eq. 11, and the similarity score Rk n is the nth position of Rk.

5.4 Fine-tune Classifier

The Fine-tune Classifier consists of Self-CECM and a linear layer. It predicts the query feature Q into N categories by a linear layer WF . And its loss is computed as:

LF = PCE WF ( Q), N q

n=1 N q i log σ WF ( Qn)i , (13)

where, PCE is patch-wise cross-entropy, and N q i is the ground truth of xq i with N categories of few-shot task.

5.5 Objective functions in Base Training

Metric Loss The metric classification loss with the groundtruth few-shot label yq is:

n=1 log ˆY (y = yq i |( Qn)i). (14)

Auxiliary Loss The loss of Global Classifier is LG = PCE(WG( Q), Dq), where Dq i is the global category of xq i with D classes of train set, and WG is a fullyconnected layer. Similarly, the loss of Rotation Classifier is LR = PCE(WR( Q), Bq), where Bq i is the rotation category of xq i with four classes, and WR is a fully-connected layer.

Multi-Task Loss Then, inspired by [Jinxiang and Siqian, 2022], the overall loss is defined as:

(λ + wj)Lj + log 1 (λ + wj)

where w = 1 2α2 and α is a learnable variable. The hyperparameter λ is utilized to balance the few-shot and auxiliary tasks, of which the influence is studied in Tab. 4.

6 Experiments on Few-Shot Classification Datasets The two popular FSL classification benchmark datasets are mini Image Net and tiered Image Net, where detailed introductions are presented in APPENDIX. Experimental Setup We report the mean accuracy by testing 2000 episodes randomly sampled from meta-test set. According to Tab. 4, the hyperparameter λ is set to 1.0 and 2.0 for Res Net-12 and WRN-28, respectively. Other implementation details can be found in our public code.

6.1 Comparison with State-of-the-arts As shown in Tab.1, we compare with the state-of-the-art fewshot methods on mini Image Net and tiered Image Net datasets. It shows that our CECNet outperforms the existing SOTAs, which demonstrates the effectiveness and strength of our CEC based methods. Different from existing metric-based methods [Zhang et al., 2020a; Yang et al., 2022; Jiangtao et al., 2022] extracting support and query features independently, our CECNet enhances the semantic feature regions of mutually similar objects and obtains more discriminative representations. Comparing to the metric-based Meta-Deep BDC [Jiangtao et al., 2022], CECNet achieves 1.98% higher accuracy on 1-shot. Some metric-based methods [Xu et al., 2021a; Hou et al., 2019] apply cross attention, while our CECNet still surpasses DANet [Xu et al., 2021a] with an accuracy improvement up to 2.36% under WRN-28 backbone, which demonstrates the strength of our Clustered-patch Element Connection.

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

Model Backbone mini Image Net tiered Image Net 1-shot 5-shot 1-shot 5-shot Proto Net [Snell et al., 2017] Conv4 49.42 0.78 68.20 0.66 53.31 0.89 72.69 0.74 Our CECNet Conv4 54.45 0.47 70.57 0.38 56.59 0.50 72.86 0.42 CAN [Hou et al., 2019] Res Net-12 63.85 0.48 79.44 0.34 69.89 0.51 84.23 0.37 Deep EMD [Zhang et al., 2020a] Res Net-12 65.91 0.82 82.41 0.56 71.16 0.87 86.03 0.58 IENet [Rizve et al., 2021] Res Net-12 66.82 0.80 84.35 0.51 71.87 0.89 86.82 0.58 DANet [Xu et al., 2021a] Res Net-12 67.76 0.46 82.71 0.31 71.89 0.52 85.96 0.35 MCL [Yang et al., 2022] Res Net-12 67.36 0.20 83.63 0.20 71.76 0.20 86.01 0.20 Meta-Deep BDC [Jiangtao et al., 2022] Res Net-12 67.34 0.43 84.46 0.28 72.34 0.49 87.31 0.32 Our CECNet Res Net-12 69.32 0.46 84.65 0.32 73.14 0.50 86.88 0.36 PSST [Zhengyu et al., 2021] WRN-28 64.16 0.44 80.64 0.32 - - DANet [Xu et al., 2021a] WRN-28 67.84 0.46 82.74 0.31 72.18 0.52 86.26 0.35 Our CECNet WRN-28 70.20 0.46 85.00 0.30 73.84 0.50 87.36 0.34

Table 1: Comparing to existing approaches on 5-way FSL classification task on mini Image Net and tiered Image Net. Our CECNet adopts the proposed CECM(T) attention module, CECD(C) distance metric, and Self-CECM.

Model Attention Distance Param mini Image Net Module Metric 1-shot 5-shot Proto G - cosine 7.75M 61.87 78.87 CAN CAM 7.75M 63.85 79.44

7.75M 67.69 81.84 CECM(C) 7.75M 67.65 81.79 CECM(G) 8.00M 67.80 82.15 CECM(T) 10.25M 67.91 82.40

Table 2: The 5-way classification results studying the influence of CECM with Res Net-12. In line with the setting of CAN, cosine distance metric is applied, and Rotation and Fine-tune classifications are not applied. The CECM(M/C/G/T) denote different modes of Patch Cluster such as Mat Mul, Cosine, GCN and Transformer. Based on Proto Net, Proto G adds auxiliary global classification task.

6.2 Ablation Study Influence of CECM As shown in Tab.2, comparing CECNet to Proto G, it shows consistent improvements on 1/5-shot classifications, because our CECM enhances the mutually similar regions and produces more discriminative representations. Comparing with CAN adopting cross attention module CAM, our CECNet achieves obvious improvements up to 4.06% on 1-shot task. The results of CECM(M), CECM(C), CECM(G) and CECM(T) show that CECM is not sensitive to alternative modes such as Mat Mul, Cosine, GCN and Transformer, which indicates the generic Patch Cluster behavior is the key insight for the improvements.

Influence of CECD As shown in Tab.3 without attention module, comparing CECNet to Proto G, it shows consistent improvements, because our CECD distance metric can obtain a more reliable similarity map. Besides, the results show that the best combination is CECM(T) + CECD(C).

Influence of Multi-Task Loss In Tab.4 with the integration of auxiliary tasks, our CECNet obtains large improvements, which indicates that learning a good embedding is helpful.

Influence of CECM+CECD As shown in Tab.5, comparing to Proto G (no-attention + cosine), our methods adopting CECM(T) + cosine and no-attention + CECD(C) achieve

Model Attention Distance Param mini Image Net Module Metric 1-shot 5-shot Proto G - cosine 7.75M 61.87 78.87

CECD(M) 7.75M 67.50 82.00 CECD(C) 7.75M 67.89 82.02 CECD(G) 8.00M 67.79 81.74 CECD(T) 10.25M 67.44 81.17

CECNet CECM(T)

CECD(M) 10.25M 67.64 81.24 CECD(C) 10.25M 68.27 82.59 CECD(G) 11.25M 66.52 78.55 CECD(T) 12.75M 64.37 78.32

Table 3: The 5-way classification results studying the influence of CECD with Res Net-12. The setting is consistent with Tab.2, except for distance metric. The CECD(M/C/G/T) denote different modes such as Mat Mul, Cosine, GCN and Transformer.

Support Query CAN Embedding CECNet

Support Query RQ

Figure 4: (a) The class activation maps on 5-way 1-shot classification, where Embedding belongs to CECNet. (b) The visualizations of our CEC-based relation map RQ.

obvious improvements, which demonstrates the effectiveness of the proposed CECM and CECD. The combination of CECM(T) + CECD(C) obtains further performance gains.

Influence of Self-CECM As illustrated in Tab.6, the baseline is the Metric Classifier of CECNet, and the competitor is Fine-tune Classifier with only Linear layer. By comparing

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

λ Loss weights Res Net-12 WRN-28 Metric Global Rotation 1-shot 5-shot 1-shot 5-shot - 0.5 - - 62.45 79.50 61.98 76.64 - 0.5 - 1.0 65.54 79.55 63.47 77.62 - 0.5 1.0 - 68.27 82.59 67.13 81.95 - 0.5 1.0 1.0 68.86 83.67 69.49 83.71 0.5 0.5 w G w R 69.05 83.86 69.33 83.55 1.0 0.5 w G w R 69.32 84.21 69.66 84.09 1.5 0.5 w G w R 69.15 84.03 69.86 84.30 2.0 0.5 w G w R 69.18 83.29 70.20 84.59

Table 4: The 5-way classification results on mini Image Net studying the influence of multi-task loss applied in CECNet.

Attention Distance Param mini Image Net Module Metric 1-shot 5-shot - cosine 7.75M 65.59 0.47 80.94 0.33 CECM(T) cosine 10.25M 68.27 0.46 83.43 0.32 - CECD(C) 7.75M 68.79 0.46 83.39 0.32 CECM(T) CECD(C) 10.25M 69.32 0.46 84.21 0.32

Table 5: The 5-way results studying the influence of CECM+CECD, under Res Net-12 applying multi-task loss with λ = 1.0.

Self-CECM+Linear to Linear, it shows consistent improvements, which demonstrates the usefulness of Self-CECM. By comparing Metric+Fine-tune to Metric Classifier, it shows an improvement on 5-shot classification.

6.3 Visualization Analysis Fig.4(a) shows the class activation maps [Bolei et al., 2016] of our CECNet and CAN [Hou et al., 2019]. Comparing CECNet to its Embedding, CECNet can highlight the target object which is unseen in the pre-training stage. Comparing to CAN, CECNet is more accurate and has larger receptive fields. The essential is that our Clustered-patch Element Connection utilizes the global info to implement element connection leading to a more confident correlation and a more clear connection. Fig.4(b) shows the visualizations of the CECbased relation map RQ generated by CECNet via Eq.7. Our CEC approach produces a high-quality relation map with a more complete region for the target.

7 Applications on FSSS and FSOD Tasks In this section, we first introduce a novel CEC-based embedding module named CEC Embedding (CECE). Then, we extend the proposed CECE into few-shot semantic segmentation (FSSS) and object detection (FSOD) tasks. The experimental results in Tab.7 and Tab.8 show that our CECE can achieve performance improvements around 1% 3%, and more extensive results are presented in APPENDIX. CEC Embedding f CECE is expressed as: Q = f CECE(Q) = f CEC(Q, WE). (16)

where, {Q, Q } Rhw c are the input and output features respectively, and WE Rne c are learnable weights (pytorch code is WE = nn.Embedding(ne, c), and ne represents the number of semantic groups, and the empirical setting is ne = 5). The proposed CECE can enhance the target

Metric Fine-tune Classifier mini Image Net classifier Self-CECM Linear 1-shot 5-shot - - 70.20 0.46 84.59 0.30 - - 69.20 0.47 84.40 0.30 - 69.36 0.46 84.78 0.30 70.20 0.46 85.00 0.30

Table 6: The 5-way results of CECNet studying the influence of Self-CECM, under WRN-28 applying multi-task loss with λ = 2.0.

Model PASCAL-5i COCO-20i 1-shot 5-shot 1-shot 5-shot PPNet [Liu et al., 2020] 51.5 62.0 25.7 36.2 Re PRI [Malik et al., 2021] 59.3 64.8 36.6 45.2 Re PRI+CECE(M) 60.4 66.5 38.3 46.9 Re PRI+CECE(T) 60.5 66.2 38.1 46.7

Table 7: Comparison on PASCAL-5i and COCO-20i few-shot semantic segmentation benchmarks using m Io U with Res Net-50. The CECE(M/T) denote different modes of Mat Mul and Transformer.

Model PASCAL COCO 1-shot 5-shot 1-shot 5-shot De FRCN [Qiao et al., 2021] 52.5 60.7 6.5 15.3 MFDC [Wu et al., 2022] 56.1 62.2 10.8 16.4 MFDC+CECE(M) 59.4 63.4 11.5 17.2 MFDC+CECE(T) 58.7 64.9 11.2 16.9

Table 8: Comparison on PASCAL Novel Split 3 (n AP50) and COCO (nm AP) few-shot object detection benchmarks with Res Net-101.

regions of input features that are semantically similar to WE, where WE contains the semantic information of base categories after trained on the base dataset.

CECE Applications As an embedding module, our CECE can be stacked after the backbone network. To verify the effectiveness of the proposed CECE, we insert it into the FSSS method Re PRI [Malik et al., 2021] and FSOD method MFDC [Wu et al., 2022], via stacking CECE after their backbones. As illustrated in Tab.7 and Tab.8, our CECE can make consistent improvements upon Re PRI and MFDC methods.

8 Conclusion

We propose a novel Clustered-patch Element Connection network (CECNet) for few-shot classification. Firstly, we design a Clustered-patch Element Connection (CEC) layer, which strengthens the target regions of query features by element-wisely connecting them with the clustered-patch features. Then three useful CEC-based modules are derived: CECM and Self-CECM generate more discriminative features, and CECD distance metric obtains a reliable similarity map. Extensive experiments prove that our method is effective, and achieves the state-of-the-arts on few-shot classification benchmark. Furthermore, our CEC approach can be extended into few-shot segmentation and detection tasks, which achieves competitive improvements.

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

References [Bolei et al., 2016] Zhou Bolei, Khosla Aditya, Lapedriza Agata, Oliva Aude, and Torralba Antonio. Learning deep features for discriminative localization. In CVPR, 2016. [Bruna et al., 2013] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le Cun. Spectral networks and locally connected networks on graphs. ar Xiv preprint ar Xiv:1312.6203, 2013. [Finn et al., 2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017. [Hou et al., 2019] Ruibing Hou, Hong Chang, MA Bingpeng, Shiguang Shan, and Xilin Chen. Cross attention network for few-shot classification. In Neur IPS, 2019. [Jiangtao et al., 2022] Xie Jiangtao, Long Fei, Lv Jiaming, Wang Qilong, and Li Peihua. Joint distribution matters: Deep brownian distance covariance for few-shot classification. In CVPR, 2022. [Jinxiang and Siqian, 2022] Lai Jinxiang and Yang Siqian. Adaptive multi distance metrics for few-shot classification. In ar Xiv, 2022. [Kang et al., 2019] Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng, and Trevor Darrell. Few-shot object detection via feature reweighting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8420 8429, 2019. [Kipf and Welling, 2017] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017. [Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Neur IPS, 2012. [Liu et al., 2020] Yongfei Liu, Xiangyi Zhang, Songyang Zhang, and Xuming He. Part-aware prototype network for few-shot semantic segmentation. In ECCV, 2020. [Malik et al., 2021] Boudiaf Malik, Kervadec Hoel, Imtiaz Masud Ziko, Piantanida Pablo, Ben Ayed Ismail, and Dolz Jose. Few-shot segmentation without meta-learning: A good transductive inference is all you need? In CVPR, 2021. [Qiao et al., 2021] Limeng Qiao, Yuxuan Zhao, Zhiyuan Li, Xi Qiu, Jianan Wu, and Chi Zhang. Defrcn: Decoupled faster r-cnn for few-shot object detection. In ICCV, 2021. [Rizve et al., 2021] Mamshad Nayeem Rizve, Salman Khan, Fahad Shahbaz Khan, and Mubarak Shah. Exploring complementary strengths of invariant and equivariant representations for few-shot learning. In CVPR, 2021. [Siam et al., 2019] Mennatullah Siam, Boris N Oreshkin, and Martin Jagersand. AMP: Adaptive masked proxies for few-shot segmentation. In ICCV, 2019. [Snell et al., 2017] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Neur IPS, 2017.

[Tian et al., 2020] Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola. Rethinking few-shot image classification: a good embedding is all you need? In ECCV, 2020. [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neur IPS, 2017. [Vinyals et al., 2016] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Neur IPS, 2016. [Wang et al., 2020] Xin Wang, Thomas E Huang, Trevor Darrell, Joseph E Gonzalez, and Fisher Yu. Frustratingly simple few-shot object detection. ar Xiv preprint ar Xiv:2003.06957, 2020. [Wu et al., 2022] Shuang Wu, Wenjie Pei, Dianwen Mei, Fanglin Chen, Jiandong Tian, and Guangming Lu. Multifaceted distillation of base-novel commonality for fewshot object detection. In ECCV, 2022. [Xu et al., 2021a] Chengming Xu, Yanwei Fu, Chen Liu, Chengjie Wang, Jilin Li, Feiyue Huang, Li Zhang, and Xiangyang Xue. Learning dynamic alignment via meta-filter for few-shot learning. In CVPR, 2021. [Xu et al., 2021b] Luo Xu, Wei Longhui, Wen Liangjian, Yang Jinrong, Xie Lingxi, Xu Zenglin, and Tian Qi. Rectifying the shortcut learning of background for few-shot learning. Neur IPS, 2021. [Yang et al., 2022] Liu Yang, Zhang Weifeng, Xiang Chao, Zheng Tu, Cai Deng, and He Xiaofei. Learning to affiliate: Mutual centralized learning for few-shot classification. In CVPR, 2022. [Zhang et al., 2020a] Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. Deepemd: Few-shot image classification with differentiable earth mover s distance and structured classifiers. In CVPR, 2020. [Zhang et al., 2020b] Xiaolin Zhang, Yunchao Wei, Yi Yang, and Thomas S Huang. SG-one: Similarity guidance network for one-shot semantic segmentation. IEEE Transactions on Cybernetics, 2020. [Zhengyu et al., 2021] Chen Zhengyu, Ge Jixie, Zhan Heshen, Huang Siteng, and Wang Donglin. Pareto selfsupervised training for few-shot learning. In CVPR, 2021.

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)