# transhp_image_classification_with_hierarchical_prompting__0b57b819.pdf

Trans HP: Image Classiﬁcation with Hierarchical Prompting

Wenhao Wang Re LER, University of Technology Sydney Yifan Sun Baidu Inc.

Wei Li Zhejiang University Yi Yang Zhejiang University

This paper explores a hierarchical prompting mechanism for the hierarchical image classiﬁcation (HIC) task. Different from prior HIC methods, our hierarchical prompting is the ﬁrst to explicitly inject ancestor-class information as a tokenized hint that beneﬁts the descendant-class discrimination. We think it well imitates human visual recognition, i.e., humans may use the ancestor class as a prompt to draw focus on the subtle differences among descendant classes. We model this prompting mechanism into a Transformer with Hierarchical Prompting (Trans HP). Trans HP consists of three steps: 1) learning a set of prompt tokens to represent the coarse (ancestor) classes, 2) on-the-ﬂy predicting the coarse class of the input image at an intermediate block, and 3) injecting the prompt token of the predicted coarse class into the intermediate feature. Though the parameters of Trans HP maintain the same for all input images, the injected coarse-class prompt conditions (modiﬁes) the subsequent feature extraction and encourages a dynamic focus on relatively subtle differences among the descendant classes. Extensive experiments show that Trans HP improves image classiﬁcation on accuracy (e.g., improving Vi T-B/16 by +2.83% Image Net classiﬁcation accuracy), training data efﬁciency (e.g., +12.69% improvement under 10% Image Net training data), and model explainability. Moreover, Trans HP also performs favorably against prior HIC methods, showing that Trans HP well exploits the hierarchical information. The code is available at: https://github.com/Wang Wenhao0716/Trans HP.

1 Introduction

Hierarchical image classiﬁcation (HIC) aims to exploit the semantic hierarchy to improve prediction accuracy. More concretely, HIC provides additional coarse labels (e.g., Rose) which indicate the ancestors of the relatively ﬁne labels (e.g., China Rose and Rose Peace). The coarse labels usually do not need manual annotation and can be automatically generated based on the ﬁne labels, e.g., through Word Net [1] or word embeddings [2]. Since it barely increases any annotation cost while bringing substantial beneﬁt, HIC is of realistic value and has drawn great research interest [3, 4].

This paper explores a novel hierarchical prompting mechanism that well imitates the human visual recognition for HIC. Speciﬁcally, a person may confuse two close visual concepts (e.g., China Rose and Rose Peace) when the scope of interest is large (e.g., the whole Plantae). However, given a prompt narrowing down the category range (e.g., the rose family), the person can shift his/her focus to the subtle variations within the coarse class. We duplicate this procedure for deep visual recognition

Work done during Wenhao Wang s internship at Baidu Inc. Corresponding author.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Prompt token

Transformer Block 𝐵!

Transformer Block B"

Transformer Block 𝐵!

Transformer Block 𝐵# (𝐭𝐡𝐞 𝐩𝐫𝐨𝐦𝐩𝐭𝐢𝐧𝐠 𝐛𝐥𝐨𝐜𝐤)

Mammal Bird Fish

Transformer Block 𝐵#

Transformer Block 𝐵$

3) Absorbing prompt tokens

Attention map Attention map

Class token

Patch token

Class token

Patch token

1) Learning prompt tokens

2) Predicting coarse classes

Figure 1: The comparison between Vision Transformer (Vi T) and the proposed Transformer with Hierarchical Prompting (Trans HP). In (a), Vi T attends to the overall foreground region and recognizes the goldﬁsh from the 1000 classes in Image Net. In (b), Trans HP uses an intermediate block to recognize the input image as belonging to the ﬁsh family and then injects the corresponding prompt. Afterward, the last block attends to the face and crown, which are particularly informative for distinguishing the goldﬁsh against other ﬁsh species. Please refer to Fig. 5 for more visualizations. Note that Trans HP may have multiple prompting blocks corresponding to multi-level hierarchy.

based on the transformer prompting technique. The transformer prompting typically uses prompts (implemented as tokens or vectors) to adapt a pre-trained transformer for different downstream tasks [5, 6, 7], domains [8], etc. In this paper, we inject coarse-class prompt into the intermediate stage of a transformer. The injected coarse-class prompt will then modify the following feature extraction for this speciﬁc coarse class, yielding the so-called hierarchical prompting. To the best of our knowledge, explicitly injecting the coarse class information as a prompt has never been explored in the HIC community.

We model our hierarchical prompting mechanism into a Transformer with Hierarchical Prompting (Trans HP). Fig. 1 compares our Trans HP against a popular transformer backbone Vi T [9]. Trans HP consists of three steps: 1) Trans HP learns a set of prompt tokens to represent all the coarse classes and selects an intermediate block as the prompting block for injecting the prompts. 2) The prompting block on-the-ﬂy predicts the coarse class of the input image. 3) The prompting block injects the prompt token of the predicted class (i.e., the target prompt token) into the intermediate feature. Speciﬁcally, Trans HP concatenates the prompt tokens with feature tokens (i.e., the class token and the patch tokens) from the preceding block, and then feeds them into the prompting block, where the feature tokens absorb information from the target prompt through cross-attention 3.

Although Trans HP is based on the prompting mechanism of the transformer, it has fundamental differences against prior transformer prompting techniques. A detailed comparison is in Section 2. We hypothesize this hierarchical prompting will encourage Trans HP to dynamically focus on the subtle differences among the descendant classes. Fig. 1 validates our hypothesis by visualizing the attention map of the ﬁnal-block class token. In Fig. 1 (a), given a goldﬁsh image, the baseline model (Vi T) attends to the whole body for recognizing it from the entire 1000 classes in Image Net. In contrast, in Fig. 1 (b), since the intermediate block has already received the prompt of ﬁsh , Trans HP mainly attends to the face and crown which are particularly informative for distinguishing the goldﬁsh against other ﬁsh species. Please refer to Section 4.4 for more visualization examples.

We conduct extensive experiments on multiple image classiﬁcation datasets (e.g., Image Net [10] and i Naturalist [11]) and show that the hierarchical prompting improves the accuracy, data efﬁciency

3In practice, the absorption is in a soft manner which assigns all the prompt tokens with soft weights.

and explainability of the transformer: (1) Accuracy. Trans HP brings consistent improvement on multiple popular transformer backbones and ﬁve image classiﬁcation datasets. For example, on Image Net, Trans HP improves Vi T-B/16 [9] by +2.83% top-1 accuracy. (2) Data efﬁciency. While reducing the training data inevitably compromises the accuracy, Trans HP maintains better resistance against the insufﬁcient data problem. For example, when we reduce the training data of Image Net to 10%, Trans HP enlarges its improvement over the baseline to +12.69%. (3) Explainability. Through visualization, we observe that the proposed Trans HP shares some similar patterns with human visual recognition [12, 13], e.g., taking an overview for coarse recognition and then focusing on some critical local regions for the subsequent recognition after prompting. Moreover, Trans HP also performs favorably against prior HIC methods, showing that Trans HP well exploits the hierarchical information.

2 Related Works

Prior HIC methods have never explored the prompting mechanism. We note that the prompting technique has not been introduced to the computer vision community until the very recent 2022 [14, 15, 16, 17, 18]. Two most recent and state-of-the-art HIC methods in 2022 are not related to the prompting technique either. Speciﬁcally, Guided [3] integrates a cost-matrix-deﬁned metric into the supervision of a prototypical network. Hi Mul Con E [4] builds an embedding space in which the distance between two classes is roughly consistent with the hierarchy (e.g., two sibling classes sharing the same ancestor are relatively close, and the classes with different ancestors are far away). Some earlier works [19, 20, 21] are valuable; however, they are also not directly related to the topic of prompting.

A deeper difference resulting from the prompting mechanism is how the mapping function of the deep model is learned. Speciﬁcally, a deep visual recognition model can be viewed as a mapping function from the raw image space into the label space. All these prior methods learn a shared mapping for all the images to be recognized. In contrast, the proposed Trans HP uses the coarse-class prompt to condition itself (from an intermediate block). It can be viewed as specifying an individual mapping for different coarse classes, yielding a set of mapping functions. Importantly, Trans HP makes all these mapping functions share the same transformer and conditions the single transformer into different mapping functions through the prompting mechanism.

Prompting was ﬁrst proposed in NLP tasks [22, 23, 24], and then has drawn research interest from the computer vision community, e.g. continual learning [14, 15], image segmentation [16], and neural architecture search [17]. VPT [18] focuses on how to ﬁne-tune pre-trained Vi T models to downstream tasks efﬁciently. Prompting can efﬁciently adapt transformers to different tasks or domains while keeping the transform s parameters untouched.

Based on the prompting mechanism, our hierarchical prompting makes some novel explorations, w.r.t. the prompting objective, prompting structure, prompt selection manner, and training process. 1) Objective: previous methods usually prompt for different tasks or different domains. In contrast, Trans HP prompts for coarse classes in the hierarchy, in analogy to the hierarchical prompting in human visual recognition. 2) Structure: previous methods usually inject prompt tokens to condition the whole model. In contrast, in Trans HP, the bottom blocks is completely shared, and the prompt tokens are injected into the intermediate blocks to condition the subsequent inference. Therefore, the prompting follows a hierarchical structure in accordance to the semantic hierarchy under consideration. 3) Prompt selection: Trans HP pre-pends all the prompt tokens for different coarse classes and autonomously selects the prompt of interest, which is also new (as to be detailed in Section 3.4). 4) Training process: The prompting technique usually consists of two stages, i.e., pre-training a base model and then learning the prompts for novel downstream tasks. When learning the prompt, the pre-trained model is usually frozen. This pipeline is different from our end-to-end pipeline, i.e. no more ﬁne-tuning after this training.

3 Transformer with Hierarchical Prompting

We ﬁrst revisit a basic transformer for visual recognition (Vi T [9]) and the general prompting technique in Section 3.1. Afterward, we illustrate how to reshape an intermediate block of the

Transformer Block 𝐵! (a prompting block)

class token patch tokens prompt tokens

coarse-class prototypes

cross-entropy loss

probability scores

similarity scores

Figure 2: (i) A prompting block in Trans HP. Instead of manually selecting the prompt of the coarse class, the prompting block pre-pends the whole prompt pool consisting of M prompts (M is the number of coarse classes) and performs autonomous selection. Speciﬁcally, it learns to predict the coarse class (Section 3.2) and spontaneously selects the corresponding prompt for absorption through soft weighting (Section 3.4), i.e., the predicted class has the largest absorption weight. (ii) Autonomous prompt selection. We visualize the absorption weights of all the 20 coarse-class prompts for some CIFAR-100 images. It shows how Trans HP selects the prompts when the coarse class prediction is correct (a and b), ambiguous (c and d), and incorrect (e and f), respectively. The red and green columns correspond to the ground-truth (GT) class and the false classes, respectively. The detailed investigation is in Section 3.4.

backbone into a hierarchical prompting block for Trans HP in Section 3.2. Finally, we investigate how the prompting layer absorbs the prompt tokens into the feature tokens in Section 3.4.

3.1 Preliminaries

Vision Transformer (Vi T) ﬁrst splits an image into N patches ( xi R3 P P | i = 1, 2, . . . , N , where P P is the patch size) and then embeds each patch into a C-dimensional embedding by xi = Embed (xi). Afterward, Vi T concatenates a class token x0 cls RC to the patch tokens and feed them into the stacked transformer blocks, which is formulated as: xl cls, Xl = Bl xl 1 cls , Xl 1 , l = 1, 2, . . . , L (1)

where xl cls and Xl are the class token and the patch tokens after the l-th transformer block Bl, respectively. After the total L blocks, the ﬁnal state of the class token (x L cls) is viewed as the deep representation of the input image and is used for class prediction. In this paper, we call the concatenation of class token and patch tokens (i.e., xl 1 cls , Xl 1 ) as the feature tokens.

Prompting was ﬁrst introduced in Natural Language Processing to switch the same transformer model for different tasks by inserting a few hint words into the input sentences. More generally, it conditions the transformer to different tasks, different domains, etc, without changing the transformer parameters but only changing the prompts. To condition the model for the k-th task (or domain), a popular practice is to select a prompt pk from a prompt pool P = {p0, p1, } and pre-pend it to the ﬁrst block. Correspondingly, Eqn. 1 turns into: xl cls, Xl, pl k = Bl xl 1 cls , Xl 1, pl 1 k , (2)

where pk P (the superscript is omitted) conditions the transformer for the k-th task.

3.2 The Prompting Block of Trans HP

The proposed Trans HP selects an intermediate transformer block Bl and reshapes it into a prompting block for injecting the coarse-class information. Let us assume that there are M coarse classes.

Table 1: The balance parameters used for Lcoarse of different levels (The last 1 is the balance parameter for the ﬁnal classiﬁcation.). - denotes that this transformer layer does not have prompt tokens.

λ 0 1 2 3 4 5 6 7 8 9 10 11 Image Net 0.1 0.1 0.1 0.1 0.1 0.15 0.15 0.15 0.15 1 1 1 i Naturalist-2018 1 1 i Naturalist-2019 1 1 CIFAR-100 1 1 Deep Fashion 0.5 1 1

Correspondingly, Trans HP uses M learnable prompt tokens PM = [p0, p1, ..., p M 1] to represent these coarse classes. Our intention is to inject pk into the prompting layer, if the input image belongs to the k-th coarse class.

Instead of manually selecting the k-th prompt pk (as in Eqn. 2), Trans HP pre-pends the whole prompting pool PM = [p0, p1, ..., p M 1] to the prompting layer and makes the prompting layer automatically select pk for absorption. Speciﬁcally, through our design, Trans HP learns to automatically 1) predict the coarse class, 2) select the corresponding prompt for absorption through soft weighting , i.e., high absorption on the target prompt and low absorption on the non-target prompts. The learning procedure is illustrated in Fig. 2 (i). The output of the prompting layer is derived by: h xl cls, Xl, ˆPM i = Bl xl 1 cls , Xl 1, PM , (3)

where ˆPM is the output state of the prompt pool PM through the l-th transformer block Bl. ˆPM will not be further forwarded into the following block. Instead, we use ˆPM to predict the coarse classes of the input image. To this end, we compare ˆPM against a set of coarse-class prototypes and derive the corresponding similarity scores by:

S = {p T i wi}, i = 1, 2, , M, (4)

where wi is the learnable prototype of the i-th coarse class. We further use a softmax plus crossentropy loss to supervise the similarity scores, which is formulated as:

Lcoarse = log p T y wy PM i=1 exp p T i wi , (5)

where y is the coarse label of the input image. We note there is a difference between the above coarse classiﬁcation and the popular classiﬁcation: the popular classiﬁcation usually compares a single representation against a set of prototypes. In contrast, our coarse classiﬁcation conducts a set-to-set comparison (i.e., M tokens against M prototypes).

3.3 Overall Structure

Multiple transformer blocks for multi-level hierarchy. Some HIC datasets (e.g., Image Net-1000) have a multi-level hierarchy. According to the coarse-to-ﬁne multi-level hierarchy, Trans HP may stack multiple prompting blocks. Each prompting block is responsible for a single level in the hierarchy, and the prompting block for the coarser level is placed closer to the bottom of the transformer. The detailed structure is shown in Appendix A. Correspondingly, the overall loss function is formulated as: L = Lfine + X

l λl Ll coarse, (6)

where Lfine is the ﬁnal classiﬁcation loss, Ll coarse is the coarse-level loss from the l-th prompting block, and λl is the balance parameter.

Through the above training, the prompting layer explicitly learns the coarse-class prompts, as well as predicts the coarse class of the input image.

The position of the prompting block. Currently, we do not have an exact position scheme for inserting the prompting block, given that the hierarchy of different datasets varies a lot. However, we recommend a qualitative principle to set the position for inserting the prompts: if the number of

Figure 3: Trans HP gradually focuses on the predicted coarse class when absorbing the prompts, yielding an autonomous selection. (a) The absorption weight of the target prompt. (b) The ratio of the target prompt weight against the largest non-target prompt weight. The dataset is CIFAR-100. We visualize these statistics on both the training and validation sets.

coarse classes is small (large), the position of the corresponding prompting blocks should be close to the bottom (top). Based on this principle, we can obtain the roughly-optimal position scheme through cross-veriﬁcation. We empirically ﬁnd that Trans HP is robust to the position scheme to some extent (Fig. 7 in the Appendix). Table 1 summarizes the setting of the balance parameters and the position of prompting layers.

3.4 Trans HP Spontaneously Selects the Target Prompt

We recall that we do not manually select the coarse-class prompt for Trans HP. Instead, we concatenate the entire prompt set, i.e., PM = {p1, p2, , p M}, with the feature tokens. In this section, we will show that after Trans HP is trained to convergence, the prompting block will spontaneously select the target prompt pk (k is the predicted coarse class) for absorption and largely neglect the non-target prompts pi =k.

Speciﬁcally, the self-attention in the transformer make each token absorb information from all the tokens (i.e., the feature tokens and the prompt tokens). In Eqn. 3, given a feature token x [xcls, X] (the superscript is omitted for simplicity), we derive its absorption weights on the i-th prompt token from the self-attention, which is formulated as:

w(x pi) = exp (Q(x)TK(pi)/

d) P exp (Q(x) K([xcls, X, PM])/

where Q() and K() project the input tokens into query and keys, respectively. d is the scale factor.

Based on the absorption weights, we consider two statistics:

The absorption weight of the target prompt, i.e., w(x pk). It indicates the importance of the target prompt among all the tokens.

The absorption ratio between the target / largest non-target prompt, i.e., R(T:N) = w(x pk)/ max{w(x pi =k)}. It measures the importance of the target prompt compared with the most prominent non-target prompt.

Fig. 3 visualizes these statistics at each training epoch on CIFAR-100 [25], from which we make two observations:

Remark 1: The importance of the target prompt gradually increases to a high level. From Fig. 3 (a), it is observed that the absorption weight on the prompt token undergoes a rapid increase and ﬁnally reaches about 0.09. We note that 0.09 is signiﬁcantly larger than the averaged weight 1/217 (1 class token + 196 patch tokens + 20 prompt tokens).

Remark 2: The target prompt gradually dominates among all the prompts. From Fig. 3 (b), it is observed that the absorption weight on the target prompt gradually becomes much larger than the non-target prompt weight (about 4 ).

(2) No prompts

(1) Baseline

Class token

Prompt token

(3) No coarse labels

ℒ%&'()$ ℒ%&'()$

Figure 4: Comparison between Trans HP and its variants on Image Net. 1) A variant uses the coarse labels to supervise the class token in the intermediate layers (No prompts). 2) A variant injects additional tokens without supervision from the coarse-class labels (No coarse labels). 3) Trans HP injects coarse-class information through prompt tokens and achieves the largest improvement (Ours).

Table 2: The top-1 accuracy of Trans HP on some other datasets (besides Image Net). w Pre or w/o Pre denotes the models are trained from Image Net pre-training or from scratch, respectively.

Accuracy (%) i Naturalist-2018 i Naturalist-2019 CIFAR-100 Deep Fashion Baseline (w/o Pre) 51.07 57.33 61.77 83.42 Trans HP (w/o Pre) 53.22 59.24 67.09 85.72 Baseline (w Pre) 63.01 69.31 84.98 88.54 Trans HP (w Pre) 64.21 71.62 86.85 89.93

Combining the above two observations, we infer that during training, the prompting block of Trans HP learns to focus on the target prompt pk (within the entire prompt pool PM) for prompt absorption (Remark 2), yielding a soft-weighted selection on the target prompt. This dynamic absorption on the target prompt largely impacts the self-attention in the prompting layer (Remark 1) and conditions the subsequent feature extraction.

Fig.2 (ii) further visualizes some instances from CIFAR-100 (20 coarse classes) for intuitively understanding the prompt absorption. We note that the coarse prediction may sometimes be incorrect. Therefore, we use the red (green) column to mark the prompts of the true (false) coarse class, respectively. In (a) and (b), Trans HP correctly recognizes the coarse class of the input images and makes accurate prompt selection. The prompt of the true class has the largest absorption weight and thus dominates the prompt absorption. In (c) and (d), Trans HP encounters some confusion for distinguishing two similar coarse classes (due to their inherent similarity or image blur), and thus makes ambiguous selection. In (e) and (f), Trans HP makes incorrect coarse-class prediction and correspondingly selects the prompt of a false class as the target prompt.

4 Experiments

4.1 Implementation Details

Datasets. We evaluate the proposed Trans HP on ﬁve datasets with hierarchical labels, i.e., Image Net [10], i Naturalist-2018 [11], i Naturalist-2019 [11], CIFAR-100 [25], and Deep Fashion-inshop [26]. The hierarchical labels of Image Net are from Word Net [1], with details illustrated on Mike s website. Both the i Naturalist-2018/2019 have two-level hierarchical annotations: a super-category (14/6 classes) for the genus, and 8, 142/1, 010 categories for the species. CIFAR-100 also has twolevel hierarchical annotations: the coarse level has 20 classes, and the ﬁne level has 100 classes. Deep Fashion-inshop is a retrieval dataset with three-level hierarchy. To modify it for the classiﬁcation task, we random select 1/2 images from each class for training, and the remaining 1/2 images for validation. Both the training and validation set contain 2 coarse classes, 17 middle classes, and 7, 982 ﬁne classes, respectively.

Training and inference details. Our Trans HP adopts an end-to-end training process. We use a lightweight transformer as our major baseline, which has 6 heads (half of Vi T-B) and 12 blocks. The dimension of the embedding and the prompt token is 384 (half of Vi T-B). We train it for 300 epochs on 8 Nvidia A100 GPUs and Py Torch. The base learning rate is 0.001 with cosine learning rate. We set the batch size, the weight decay and the number of warming up epochs as

Table 3: Trans HP brings consistent improvement on various backbones on Image Net.

Accuracy (%) Vi T-B/16 Vi T-L/16 Dei T-S Dei T-B Baseline 76.68 76.37 79.82 81.80 Trans HP 79.51 78.80 80.55 82.35 The performance of our reproduced Vi T-B/16 and Vi T-L/16 are slightly worse than 77.91 and 76.53 in its original paper [9], respectively.

Table 4: Comparison between Trans HP and two most recent state-of-the-art methods. We replace their CNN backbones with the relatively strong transformer backbone for fair comparison.

Accuracy (%) Image Net i Nat-2018 i Nat-2019 CIFAR-100 Deep Fashion Baseline 76.21 63.01 69.31 84.98 88.54 Guided 76.05 63.11 69.66 85.10 88.32 Hi Mul Con E 77.52 63.46 70.87 85.43 88.87 Trans HP 78.65 64.21 71.62 86.85 89.93

1, 024, 0.05 and 5, respectively. Importantly, Trans HP only adds small overhead to the baseline. Speciﬁcally, compared with the baseline (22.05 million parameters), our Trans HP only adds 0.60 million parameters (about +2.7%) for Image Net. When using Vi T-B as the backbone, our Trans HP only adds +1.4% parameters. Due to the increase of parameters and the extra cost of the backward of several Lcoarses, the training time increases by 15% on our baseline and 12% on Vi T-B for Image Net. For inference, the computation overhead is very light. The baseline and Trans HP both use around 50 seconds to ﬁnish the Image Net validation with 8 A100 GPUs.

4.2 Trans HP Improves the Accuracy

Improvement on Image Net and the ablation study. We validate the effectiveness of Trans HP on Image Net and conduct the ablation study by comparing Trans HP against two variants, as well as the baseline. As illustrated in Fig. 4, the two variants are: 1) we do not inject any prompts, but use the coarse labels to supervise the class token in the intermediate layers: similar with the ﬁnal ﬁne-level classiﬁcation, the class token is also used for coarse-level classiﬁcation. 2) we inject learnable tokens, but do not use the coarse labels as their supervision signal. Therefore, these tokens do not contain any coarse class information. From Fig. 4, we draw three observations as below: 1) Comparing Trans HP against the baseline, we observe a clear improvement of +2.44% top-1 accuracy, conﬁrming the effectiveness of Trans HP on Image Net classiﬁcation. 2) Variant 1 ( No prompts ) achieves some improvement (+1.37%) over the baseline as well, but is still lower than Trans HP by 1.07%. It shows that using the hierarchical labels to supervise the intermediate state of the class token is also beneﬁcial. However, since it does not absorb the prompting information, the improvement is relatively small. We thus infer that the hierarchical prompting is a superior approach for utilizing the hierarchical labels. 3) Variant 2 ( No coarse labels ) barely achieves any improvement over the baseline, though it also increases the same amount of parameters as Trans HP. It indicates that the beneﬁt of Trans HP is not due to the increase of some trainable tokens. Instead, the coarse class information injected through the prompt tokens matters.

Trans HP gains consistent improvements on more datasets. Besides the most commonly used dataset Image Net, we also conduct experiments on some other datasets, i.e., i Naturalist-2018, i Naturalist-2019, CIFAR-100 and Deep Fashion. For these datasets, we use two settings, i.e., training from scratch (w/o Pre) and ﬁnetuning from the Image Net-pretrained model (w Pre). The experimental results are shown in Table 2, from which we draw two observations. First, under both settings, Trans HP brings consistent improvement over the baselines. Second, when there is no pre-training, the improvement is even larger, especially on small datasets. For example, we note that on the smallest CIFAR-100, the improvement under w/o Pre and w Pre are +5.32% and +1.87%, respectively. We infer it is because Trans HP considerably alleviates the data-hungry problem of the transformer, which is further validated in Section 4.3.

Trans HP improves various backbones. Besides the light transformer baseline used in all the other parts of this section, Table 3 evaluates the proposed Trans HP on some more backbones, i.e., Vi T-B/16

Table 5: Comparison between Trans HP and prior state-of-the-art hierarchical classiﬁcation methods under the insufﬁcient data scenario. N%" means using N% Image Net training data.

Accuracy (%) 100% 50% 20% 10% Baseline 76.21 67.87 44.60 25.24 Guided 76.05 67.74 45.02 25.67 Hi Mul Con E 77.52 69.23 48.50 30.76 Trans HP 78.65 70.74 53.71 37.93

[9], Vi T-L/16 [9], Dei T-S [27], and Dei T-B [27]. We observe that for the Image Net classiﬁcation, Trans HP gains 2.83%, 2.43%, 0.73%, and 0.55% improvement on these four backbones, respectively.

Comparison with state-of-the-art hierarchical classiﬁcation methods. We compare the proposed Trans HP with two most recent hierarchy-based methods, i.e. Guided [3], Hi Mul Con E [4]. We do not include more competing methods because most prior works are based on the convolutional backbones and are thus not directly comparable with ours. Since the experiments on large-scale datasets is very time-consuming, we only select the most recent state-of-the-art methods and re-implement them on the same transformer backbone (based on their released code). The experimental results are shown in Table 4. It is clearly observed that the proposed Trans HP achieves higher improvement and surpasses the two competing methods. For example, on the ﬁve datasets, Trans HP surpasses the most recent state-of-the-art Hi Mul Con E by +1.13% (Image Net), +0.75% (i Nat-2018), +0.75% (i Nat-2019), +1.42% (CIFAR-100) and 1.06% (Deep Fashion), respectively. We also notice that while Guided achieves considerable improvement on the CNN backbones, its improvement over our transformer backbone is trivial. This is reasonable because improvement over higher baseline (i.e., the transformer backbone) is relatively difﬁcult. This observation is consistent with [4].

4.3 Trans HP Improves Data Efﬁciency

We investigate Trans HP under the data-scarce scenario. To this end, we randomly select 1/10, 1/5, and 1/2 training data from each class in Image Net (while keeping the validation set untouched). The results are summarized in Table 5, from which we draw three observations as below:

First, as the training data decreases, all the methods undergo a signiﬁcant accuracy drop. This is reasonable because the deep learning method in its nature is data-hungry, and arguably this datahungry problem is further underlined in transformer [9]. Second, compared with the baseline and two competing hierarchy-based methods, Trans HP presents much higher resistance against the data decrease. For example, when the training data is reduced from 100% 10%, the accuracy drop of the baseline and two competing methods are 50.97%, 50.38% and 46.76%, respectively. In contrast, the accuracy drop of the proposed Trans HP (40.72%) is signiﬁcantly smaller. Third, since Trans HP undergoes relatively smaller accuracy decrease, its superiority under the low-data regime is even larger. For example, its surpasses the most competing Hi Mul Con E by 1.13%, 1.51%, 5.21% and 7.17% under the 100%, 50%, 20% and 10% training data, respectively. Combining all these observations, we conclude that Trans HP improves the data efﬁciency. The efﬁciency can be explained intuitively by drawing upon two perspectives, one philosophical and the other technical. Philosophical Perspective: Imagine knowledge as the essence of everything humans have summarized over time. When you possess knowledge, you have the distilled essence of myriad experiences and learnings. The proposed method leverages this accumulated knowledge. In scenarios where data is limited, the power of such distilled knowledge becomes even more pronounced. Technical Perspective: Now, think of data not just as isolated pieces of information but in categories. Even when the dataset might seem limited, there could still be ample samples within broader categories. This means that for these coarser categories, accuracy can be achieved rapidly. Once the accuracy at this coarse level is established, the model can then use this foundation to prompt further. It s like planting a tree - you start with a strong base and then branch out.

4.4 Trans HP Improves Model Explainability

We analyze the receptive ﬁeld of the class token to understand how Trans HP reaches its prediction. Basically, the transformer integrates information across the entire image according to the attention map, yielding its receptive ﬁeld. Therefore, we visualize the attention map of the class token in Fig. 5.

Input image Baseline Trans HP (coarse) Trans HP (fine) Input image Baseline Trans HP (coarse) Trans HP (fine)

hen bird hen

doormat doormat covering

dome dome protective covering

tench tench fish

ostrich ostrich

house finch house finch

dog bernese mountain dog bernese mountain dog

German shepsherd German shepsherd

Figure 5: Visualization of the attention map for analyzing the receptive ﬁeld. For Trans HP, we visualize a block before and after receiving the prompt (i.e., coarse and ﬁne), respectively. The coarse block favors an overview for coarse recognition, and the ﬁne block further ﬁlters out the non-relevant regions after receiving the prompt.

For the proposed Trans HP, we visualize the attention map at the prompting block (which handles the coarse-class information) and the last block (which handles the ﬁne-class information). For the Vi T baseline, we only visualize the attention score map of the the last block. We draw two observations from Fig. 5:

First, Trans HP has a different attention pattern compared with the baseline. The baseline attention generally covers the entire foreground, which is consistent with the observation in [9]. In contrast, in Trans HP, although the coarse block attends to the overall foreground as well, the ﬁne block concentrates its attention on relatively small and critical regions, in pace with the prompting predicting procedure. For example, given the hen image on the second row (left), Trans HP attends to the overall foreground before receiving the coarse-class prompt (i.e., the bird) and focuses to the eyes and bill for recognizing the hen out from the bird . Second, Trans HP shows better capacity for ignoring the redundant and non-relevant regions. For example, given the doormat image on the fourth row (right), Trans HP ignores the decoration of GO AWAY after receiving the coarse-class prompt of covering . Similar observation is with the third row (right), where Trans HP ignores the walls when recognizing the dome out from protective covering .

5 Conclusion

This paper proposes a novel Transformer with Hierarchical Prompting (Trans HP) for image classiﬁcation. Before giving its ﬁnal prediction, Trans HP predicts the coarse class with an intermediate layer and correspondingly injects the coarse-class prompt to condition the subsequent inference. An intuitive effect of our hierarchical prompting is: Trans HP favors an overview of the object for coarse prediction and then concentrates its attention to some critical local regions after receiving the prompt, which is similar to the human visual recognition. We validate the effectiveness of Trans HP through extensive experiments and hope the hierarchical prompting reveals a new insight for understanding the transformers.

Limitation. We presently focus on the image classiﬁcation task in this paper, while there are some other tasks that are the potential to beneﬁt from hierarchical annotations, e.g., semantic segmentation. Therefore, we would like to extend Trans HP for more visual recognition tasks in the future.

[1] George A Miller. Word Net: An electronic lexical database. MIT press, 1998.

[2] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of word representations in vector space. In International Conference on Learning Representations Workshop, 2013.

[3] Loic Landrieu and Vivien Sainte Fare Garnot. Leveraging class hierarchies with metric-guided prototype learning. In British Machine Vision Conference (BMVC), 2021.

[4] Shu Zhang, Ran Xu, Caiming Xiong, and Chetan Ramaiah. Use all the labels: A hierarchical multi-label contrastive learning framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16660 16669, 2022.

[5] Xiang Lisa Li and Percy Liang. Preﬁx-tuning: Optimizing continuous prompts for generation. ar Xiv preprint ar Xiv:2101.00190, 2021.

[6] Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. Ppt: Pre-trained prompt tuning for few-shot learning. ar Xiv preprint ar Xiv:2109.04332, 2021.

[7] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a uniﬁed view of parameter-efﬁcient transfer learning. ar Xiv preprint ar Xiv:2110.04366, 2021.

[8] Chunjiang Ge, Rui Huang, Mixue Xie, Zihang Lai, Shiji Song, Shuang Li, and Gao Huang. Domain adaptation via prompt learning. ar Xiv preprint ar Xiv:2202.06687, 2022.

[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.

[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009.

[11] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classiﬁcation and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769 8778, 2018.

[12] Omar I Johannesson, Ian M Thornton, Irene J Smith, Andrey Chetverikov, and Arni Kristjánsson. Visual foraging with ﬁngers and eye gaze. i-Perception, 7(2):2041669516637279, 2016.

[13] M Berk Mirza, Rick A Adams, Christoph Mathys, and Karl J Friston. Human visual exploration reduces uncertainty about the sensed world. Plo S one, 13(1):e0190429, 2018.

[14] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pﬁster. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139 149, 2022.

[15] Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. European Conference on Computer Vision, 2022.

[16] Timo Lüddecke and Alexander Ecker. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7086 7096, 2022.

[17] Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. Neural prompt search. 2022.

[18] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In European Conference on Computer Vision (ECCV), 2022.

[19] Jong-Chyi Su and Subhransu Maji. Semi-supervised learning with taxonomic labels. ar Xiv preprint ar Xiv:2111.11595, 2021.

[20] Kanishk Jain, Shyamgopal Karthik, and Vineet Gandhi. Test-time amendment with a coarse classiﬁer for ﬁne-grained classiﬁcation. ar Xiv preprint ar Xiv:2302.00368, 2023.

[21] Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis De Coste, Wei Di, and Yizhou Yu. Hd-cnn: hierarchical deep convolutional neural networks for large scale visual recognition. In Proceedings of the IEEE international conference on computer vision, pages 2740 2748, 2015.

[22] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020.

[23] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. In ACL/IJCNLP (1), 2021.

[24] Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423 438, 2020.

[25] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

[26] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[27] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efﬁcient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347 10357. PMLR, 2021.

Transformer Encoder Block 𝑩𝒍(The prompting block)

Transformer Encoder Block 𝑩𝒍"𝟏

Transformer Encoder Block 𝑩𝒌(The prompting block)

Transformer Encoder Block 𝑩𝒌"𝟏

Transformer Encoder Block 𝑩𝑳

Figure 6: The illustration of Trans HP with multiple layers of hierarchy. k and l are two insider layers, and L is the ﬁnal layer.

𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (%)

Adding prompt tokens from 𝑛𝑡ℎtransformer layer

10 8 6 4 2 0 1

78.50 78.57 78.70 78.85 78.61

78.66 78.61 78.50 78.65

Figure 7: The top-1 accuracy on Image Net w.r.t the transformer layer from which to add prompt tokens. The highest two transformer layers (which do not1 have too coarse-level labels) play an important role.

A Multiple layers of hierarchy

We illustrate the Trans HP in Fig. 6 when a dataset has multiple layers of hierarchy.

B Coarse-level classes of CIFAR-100

[0]: aquatic mammals, [1]: ﬁsh, [2]: ﬂowers, [3]: food containers, [4]: fruit and vegetables, [5]: household electrical devices, [6]: household furniture, [7]: insects, [8]: large carnivores, [9]: large man-made outdoor things, [10]: large natural outdoor scenes, [11]: large omnivores and herbivores, [12]: medium mammals, [13]: non-insect invertebrates, [14]: people, [15]: reptiles, [16]: small mammals, [17]: trees, [18]: vehicles-1, and [19]: vehicles-2.

C Importance analysis of classiﬁcation at different hierarchical levels

From Table 1 (Line 1), each transformer layer is responsible for one level classiﬁcation. We remove the prompt tokens from the coarsest level to the ﬁnest level. In Fig. 7, n denotes that

Table 6: The analysis of the number of coarse-level classes on the CIFAR-100 dataset. N-class" denotes that there are N classes for the coarse-level classiﬁcation.

Accuracy (%) baseline 2-class 5-class 10-class 20-class w/o Pre 61.77 63.34 63.12 64.47 67.09 w Pre 84.98 86.40 86.35 86.50 86.85

Table 7: Comparison between Trans HP with the original baseline and the No prompts" baseline.

Accuracy (%) i Nat-2018 i Nat-2019 CIFAR-100 Deep Fashion Baseline (w/o Pre) 51.07 57.33 61.77 83.42 No prompts (w/o Pre) 51.88 58.45 63.78 84.23 Trans HP (w/o Pre) 53.22 59.24 67.09 85.72 Baseline (w Pre) 63.01 69.31 84.98 88.54 No prompts (w Pre) 63.41 70.73 85.50 89.59 Trans HP (w Pre) 64.21 71.62 86.85 89.93

the prompt tokens are added from the nth transformer layer. We conclude that only the last two coarse level classiﬁcations (arranged at the 9th and 10th transformer layer) contribute most to the ﬁnal classiﬁcation accuracy. That means: (1) it is not necessary that the number of hierarchy and transformer layers are equal. (2) it is no need to adjust any parameters from too coarse level hierarchy. (Note that: though the current balance parameter for the 8th transformer layer is 0.15, when it is enlarged to 1, no further improvement is achieved.)

D Analysis of the number of coarse-level classes

As shown in Appendix B, the CIFAR-100 dataset has 20 coarse-level classes. When we combine them into 10 coarse-level classes, we have ([0-1]), ([2-17]), ([3-4]), ([5-6]), ([12-16]), ([8-11]), ([1415]), ([9-10]), ([7-13]), and ([18-19]). When we combine them into 5 coarse-level classes, we have ([0-1-12-16]), ([2-17-3-4]), ([5-6-9-10]), ([8-11-18-19]), and ([7-13-14-15]). When we combine them into 2 coarse-level classes, we have ([0-1-7-8-11-12-13-14-15-16]) and ([2-3-4-5-6-9-10-17-18-19]). The experimental results are listed in Table 6.

We observe that: 1) Generally, using more coarse-level classes is better. 2) Using only 2 coarse-level classes still brings over 1% accuracy improvement.

E The comparison with the No prompts" baseline

In this section, we provide more experiments with the No prompts" baseline. The detail of the No prompts" baseline is shown in Fig. 4 (2). The experimental results are shown in Table 7. We ﬁnd that though No prompts" baseline surpasses the original baseline, our Trans HP still shows signiﬁcant superiority over this baseline.

F Additional Lcoarse with Dei T.

We introduce the experimental results by only adopting Lcoarse in Dei T. Note that the Lcoarse is imposed on the class token as shown in Fig. 4 (2). We ﬁnd that the Trans HP still shows performance improvement compared with only using Lcoarse on Dei T-S and Dei T-B: compared with Dei TS (79.82%) and Dei T-B (81.80%), only with Lcoarse" achieves 79.98% and 81.76% while the Trans HP achieves 80.55% and 82.35%, respectively.