# tdattenmix_topdown_attention_guided_mixup__166cc6ac.pdf

Td Atten Mix: Top-Down Attention Guided Mixup

Zhiming Wang1, Lin Gu2, 3, Feng Lu1*

1State Key Laboratory of VR Technology and Systems, School of CSE, Beihang University 2RIKEN AIP 3The University of Tokyo, Japan {zy2306418,lufeng}@buaa.edu.cn, lin.gu@riken.jp

Cut Mix is a data augmentation strategy that cuts and pastes image patches to mixup training data. Existing methods pick either random or salient areas which are often inconsistent to labels, thus misguiding the training model. By our knowledge, we integrate human gaze to guide cutmix for the first time. Since human attention is driven by both high-level recognition and low-level clues, we propose a controllable Top-down Attention Guided Module to obtain a general artificial attention which balances top-down and bottom-up attention. The proposed Td ATtten Mix then picks the patches and adjust the label mixing ratio that focuses on regions relevant to the current label. Experimental results demonstrate that our Td Atten Mix outperforms existing state-of-the-art mixup methods across eight different benchmarks. Additionally, we introduce a new metric based on the human gaze and use this metric to investigate the issue of image-label inconsistency.

Code https://github.com/morning12138/Td Atten Mix

1 Introduction Thanks to large amount of data, Deep Neural Networks (DNNs) have achieved significant success in recent years across a variety of applications, including recognition (Dosovitskiy et al. 2021; Zang et al. 2022; Cui et al. 2022; Tan et al. 2022; Chen, Fan, and Panda 2021), graph learning (Xia et al. 2022; Wu et al. 2023; Cheng et al. 2022), and video processing (Liu et al. 2021a; Cui et al. 2021; Liu et al. 2021b; Zhao et al. 2022). However, the data-hungry problem (Dosovitskiy et al. 2021; Touvron et al. 2021a) leads to overfitting when the training data are scarce. Therefore, a series of data augmentation techniques called mixup are proposed to alleviate this issue and enhance DNNs generalization capabilities. Among them, Cut Mix (Yun et al. 2019) is an effective strategy that randomly crops a patch from the source image and pastes it into the target image. The label is then mixed by the source and target labels in proportion to the crop area ratio. Since the randomness in Cut Mix (Yun et al. 2019) ignores the spatial saliency, a group of saliency-based variants (Uddin et al. 2021; Kim, Choo, and Song 2020; Walawalkar

*Corresponding Author. Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Original Image Saliency Mix Td Atten Mix (Ours)

Saliency Map

Mixed Image

Task Adaptive Attention Map

Mixed Image

Dog: 0.20 Bird: 0.80 Dog: 0.15 Bird: 0.85

Bird Target Image

Source Image

Td Atten Mix

Attentive Mix

Saliency Mix Cut Mix Mix Up

Td Atten Mix

Other Methods

Figure 1: Left: Saliency Mix vs. Td Atten Mix. Since Saliency Mix selects to crop the patch with the most salient region, it is distracted by irrelevant dark stone. Our Td Atten Mix balances top-down and bottom-up attention and thus picks salient areas consistent with the dog label. Right: Traing time vs. accuracy with Deit-S on Image Net-1k. Td Atten Mix improves performance without the heavy computational overhead.

et al. 2020; Liu et al. 2022c; Dabouei et al. 2021; Chen et al. 2023) leverage the bottom-up attention as a supervisory signal. Bottom-up attention operates on raw sensory input and orients attention towards visual features of potential importance to calculate the saliency. This process only discovers what is where in the world (Schwinn et al. 2022), which equally looks for all salient regions in the raw sensory input. Therefore, existing saliency-based variants based on bottom-up attention are easily distracted by high saliency regions that are, in fact, irrelevant to the target label. For instance, the source image of Figure 1 is dog, but Saliency Mix (Uddin et al. 2021) become distracted by the dark rock and crops the background, including only part of the dos s ear. Human vision entails more than just the determination of what is where; it involves the development of internal representations that facilitate future actions. For instance, psychological research (Buswell 1935; Yarbus 2013; Belardinelli, Herbort, and Butz 2015) found that human gaze, initially guided by bottom-up features, can be strongly influenced by the task at hand. Consequently, recent research proposes topdown mechanisms (Shi, Darrell, and Wang 2023; Schwinn et al. 2022) showing effectiveness in modeling human gaze patterns such as scanpaths (Schwinn et al. 2022) and in en-

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

Minimal attention region Maximum attention region

𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴 𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨

Task Adaptive Attention Map

𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝒈𝒈

Target Image 𝒙𝒙𝑩𝑩

Source Image 𝒙𝒙𝑨𝑨

Training Sample

𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓

Image Features 𝒕𝒕

Bottom-up Attention

𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔 Top-down Attention Guided Module

(I) Task Adaptive Attention Guided Cut Mix (Sec. 4.1) (II) Area-Attention Mixing Label (Sec. 4.2)

Attention Ratio 𝝀𝝀𝒂𝒂

0.25 Dog 0.75 Bird

Area Ratio 𝝀𝝀𝒓𝒓

0.15 Dog 0.85 Bird

Area Ratio 𝝀𝝀𝑴𝑴

0.2 Dog 0.8 Bird

Mixed Training Sample

𝟎𝟎. 𝟐𝟐Dog 𝟎𝟎. 𝟖𝟖Bird 𝒚𝒚𝑴𝑴 𝒙𝒙𝑴𝑴

Figure 2: The framework of Td Atten Mix. (1)Task Adaptive Attention Guided Cut Mix: compute the task adaptive attention map via manipulating the bottom-up attention using our proposed Top-down Attention Guided Module and then uses the task adaptive attention map to crop the patch. (2)Area-Attention Label Mixing: adjust label mixing based on the ratio of attention and area.

hancing downstream recognition tasks like classification and Vision Question Answering (VQA) (Shi, Darrell, and Wang 2023). For data mixing techniques, the labels of original data can be used naturally as the task at hand and the execution logic of human gaze can be modeled by the bottom-up features from the original image in conjunction with the highlevel guidance of the original label. In this paper, inspired by the task guided mechanism of human gaze, we extend the saliency-based Cut Mix to a general framework that balances top-down and bottom-up attention to cut and mix the training samples. Bottom-up attention learns features from the original input and looks for all salient regions while top-down attention uses characteristics of the category as the current task to adjust attention. As illustrated in Figure 1, our top-down attention mixup (Td Atten Mix) crops the exemplary head region of the dog image and pastes it on the background area of the target bird image and finally obtains an image-label consistent mixed data that differs significantly from mixed data generated by Saliency Mix. As portrayed in Figure 2, our Td Atten Mix involoves two steps: Task Adaptive Attention Guided Cut Mix and Area-Attention Label Mixing. The first step generalizes the bottom-up attention of saliency-based Cut Mix to the task adaptive attention via our proposed Top-down Attention Guided Module. When mixing the image, we use Max-Min Attention Region Mixing (Chen et al. 2023) to select maximum attention region from a source image and paste it onto the region with the minimal attention score in a target image. The second step determines the label mixing with Area Attention Label Mixing module. Unlike the conventional approach of area-based label assignment, this module incorporates the area ratio of the mixed image along with the attention ratio of the task adaptive attention map. In the end, our Td Atten Mix framework produces the mixed training sample (x M, y M) in Figure 2. The saliency-based Cut Mix variants aim to produce a sufficient amount of image-label consistent mixed data. As far as current knowledge allows, existing methods lack a quan-

titative approach to assess image-label inconsistency. The core of quantitative analysis lies in establishing the correct ground-truth for label assignment. Motivated by the notion that gaze mirrors human vision (Huang et al. 2020), we propose to use gaze attention on the ARISTO dataset (Liu et al. 2022b) which collects the real gaze of participants when performing fine-grained recognition tasks. This data will be used to create mixed labels, serving as the ground truth to investigate the issue of image-label inconsistency. The contribution of this paper is three-fold:

By our knowledge, this paper for the first time proposes a Top-down Attention Guided Module to integrate human gaze for an artificial attention that balances both topdown and bottom-up attention to crop the task-relevant patch and adjust the label mixing ratio. Extensive experiments demonstrate the Td Atten Mix boosts the performance and achieve state-of-the-art top1 accuracy in CIFAR100, Tiny-Image Net, CUB-200 and Image Net-1k. Moreover, as shown in Figure 1, our Td Atten Mix can achieve state-of-the-art top-1 accuracy without the heavy computational overhead. We quantitatively explore the image-label inconsistency problem in image mixing. The proposed method effectively reduces the image-label inconsistency and improves the performance.

2 Related Work 2.1 Cut Mix and its variants Cut Mix (Yun et al. 2019) randomly crops a patch from the source image and pastes it onto the corresponding location in the target image, with labels being a linear mixture of the source and target image labels proportionate to the area ratio. Since random cropping ignores the regional saliency information, researchers leverage a series of saliency-based variants based on bottom-up attention. Attentive Mix (Walawalkar et al. 2020) and Saliency Mix (Uddin et al. 2021) guide mixing patches by saliency regions in the image (based on class activation mapping or a saliency

detector(Montabone and Soto 2010)). Subsequently, Puzzle Mix (Kim, Choo, and Song 2020) and Co-Mixup (Kim et al. 2020) propose combinatorial optimization strategies to find optimal mixup that maximizes the saliency information. Then Auto Mix (Liu et al. 2022c) adaptively generates mixed samples based on mixing ratios and feature maps in an endto-end manner. Inspired by the success of Vision Transformer (Vi T)(Dosovitskiy et al. 2021), Token Mixup (Choi, Choi, and Kim 2022) is proposed to adaptive generate mixed images based on attention map. Moreover, concerning label assignment, recent studies have also adjusted label assignment by bottom-up attention. Trans Mix (Chen et al. 2022) mixes labels based on the class attention score and Token Mix (Liu et al. 2022a) assigns content-based mixes labels on mixed images. Recently, SMMix (Chen et al. 2023) motivates both image and label enhancement by the bottomup self-attention of Vi T-based model under training itself. However, these existing variants, focusing either on enhancing saliency or adjusting label assignments, are reliant on bottom-up attention, which is susceptible to being distracted by salient but label-inconsistent background areas. To relieve label inconsistency, we introduce task adaptive topdown attention into Cut Mix variants for the first time and propose our framework Td Atten Mix.

2.2 Computational modeling of Attention

Computational modeling of human visual attention intersects various disciplines such as neuroscience, cognitive psychology, and computer vision. Biologically-inspired attention mechanisms can enhance the interpretability of artificial intelligence (Vuyyuru et al. 2020). The attention can be categorized into bottom-up and top-down mechanisms (Connor, Egeth, and Yantis 2004). Initially, the focus was primarily on computational modeling of bottom-up attention. Based on the Treisman s seminal work describing the feature integration theory (Treisman and Gelade 1980), current approaches assume a central role for the saliency map. Within the theory, attention shifts are generated from the saliency map using the winner-take-all algorithm (Koch and Ullman 1985). Consequently, the majority of studies have focused on improving the estimation of the saliency map (Borji and Itti 2012; Riche et al. 2013). Recently selfattention (Dosovitskiy et al. 2021) is a stimulus-driven approach that highlights all the salient objects in an image, representing a typical bottom-up attention mechanism. With the advent of increasingly large eye-tracking datasets (Liu et al. 2022b; Jiang et al. 2015), researchers have been inspired to explore task-guided top-down attention. Shi et al. (Shi, Darrell, and Wang 2023) propose a top-down modulated Vi T model by mimicking the task-guided mechanism of human gaze. Shiwinn et al. (Schwinn et al. 2022) impose a biologically-inspired foveated vision constraint to neural networks to generate human-like scanpaths without training for this object. As for Cut Mix variants, previous saliencybased methods have utilized bottom-up attention to optimize cropping regions, whereas we explore the use of taskadaptive top-down attention to obtain a cropped region that is more consistent with the label.

3 Preliminary Cut Mix augmentation. Cut Mix (Yun et al. 2019) is a simple data augmentation technique that combines two pairs of input and labels. x and y represent a training image and its corresponding label, where x RH W C. To create a new augmented training sample (x M, y M), Cut Mix (Yun et al. 2019) utilizes a source image-label pair (x A, y A) and a target image-label pair (x B, y B). Mathematically, this can be expressed as follows:

x M = M x A + (1 M) x B (1)

y M = λry A + (1 λr)y B (2)

M {0, 1}H W denotes a rectangular binary mask that indicates where to drop or keep in the two images, is element-wise multiplication, and λr indicates the area ratio of x A in mixed image x M, i.e., λr = P M

4 Framework of Td Atten Mix This section formally introduces our Td Atten Mix, a general image mixing framework that balances top-down and bottom-up attention to simulate the task-guided mechanism of human gaze to crop the patch and adjust the label mixing ratio. Figure 2 illustrates an overview of our proposed Td Atten Mix. Details are given below.

4.1 Task Adaptive Attention Guided Cut Mix We want to simulate the execution logic of human gaze, which is initially guided by bottom-up features and then strongly influenced by the current task. Bottom-up Attention. We divide the source image x A and the target image x B into non-overlapping patches of size P P. Each image yields a total of N = H

P patches. Consequently, x A and x B are restructured as t A, t B RN (P 2C), where each row corresponds to a token and d = P 2C. As illustrated in Figure 2, we follow SMMix (Chen et al. 2023) which obtains the attention map across all the image tokens for the bottom-up attention (Dosovitskiy et al. 2021). We obtain Q = twq, K = twk, and V = twv, where wq Rd d, wk Rd d, and wv Rd d represent the learnable parameters of the fully-connected layers. Top-down Attention Guided Module. The Top-down Attention Guided Module we propose is depicted in Figure 3. The current task at hand is the classification task. Then we extract the corresponding parameters wtd Rd 1 from the final fully-connected layer of Vision Transformer, which is based on the current label. The parameter matrix from this layer mirrors the relationship between feature and category mapping. Thus, we can acquire the high-level guidance Vtd tied to a specific category by calculating it with the image feature t. The theory that top-down attention can be implemented by simply augmenting Vtd to V with K and Q remaining constant was introduced by Shi et al. (Shi, Darrell, and Wang 2023). We ensure that the dimensionality of Vtd and V is consistent through broadcasting. Furthermore, we accommodated a tunable parameter called balanced factor σ within our framework to manage the top-down features Vtd. If σ = 0, our attention map correlates with the preceding

Image Label e.g., Dog

fully-connected layer

𝑶𝑶𝑶𝑶𝑶𝑶𝑶𝑶𝑶𝑶𝑶𝑶

𝑤𝑤𝑡𝑡𝑡𝑡 Corresponding params

Top-down Attention Guided Module

Image Features 𝒕𝒕

Top-down Signal 𝑽𝑽𝒕𝒕𝒕𝒕

Balanced Factor 𝝈𝝈

Bottom-up Attention Module

High-level Task (classification)

𝒃𝒃𝒃𝒃𝒃𝒃𝒃𝒃𝒃𝒃𝒃𝒃𝒃𝒃𝒃𝒃𝒃𝒃

Figure 3: To simulate the top-down mechanism, we designed the Top-down Attention Guided Module by using the image label as the high-level task information to guide image feature generation, resulting in what we refer to as the topdown signal. This top-down signal then constrains bottomup attention to focus on regions related to the image label.

bottom-up attention utilized by SMMix (Chen et al. 2023). If σ = 1, the attention map is finalized by integrating the bottom-up features with the top-down features. As a result, we calculate the task adaptive balanced attention as follows:

Vtd = σ broadcast(twtd) (3)

V = V + Vtd (4)

Attention(Q, K, V ) = Softmax(QKT

Subsequently, the resulting task adaptive attention maps, αA R H P W

P and αB R H P W

P corresponding to x A and x B, are obtained after reshaping operation. Now our attention map is task adaptive which focuses on the object indicated by the current task while ignoring irrelevant high saliency objects. The criterion for cropping is determined by the sum of attention scores within a given region. Then we identify the region with maximum attention scores in the source image, and the region with the minimal attention sources in the target image. Specifically, the center indices are defined as:

is, js = argmax i,j

p,q α i+p h

it, jt = argmin i,j

p,q α i+p h

P , w = δ W

P , δ = Uniform(0.25, 0.75), p {0, 1, ..., h 1}, and q {0, 1, ..., w 1}. Then we use Max-Min Attention Region Mixing (Chen et al. 2023) which uses the maximum attention region to replace the minimal attention region to obtain the new mixed training image x M as follows: x M = x B (8)

2 M = x is+p h

To verify the validity of the obtained mixed image x M, we examined the prediction accuracy on the mixed image x M. As graphically represented in Figure 4, the prediction

Mixup Cut Mix Saliency Mix Td Atten Mix (Ours)

Top1 Acc. (%)

Mixing area ratio 𝝀𝝀𝒓𝒓

Mixup Cut Mix Saliency Mix Td Atten Mix (Ours)

Top2 Acc. (%)

Mixing area ratio 𝝀𝝀𝒓𝒓 Figure 4: Top-1 accuracy of mixed data. Prediction is counted as correct if the top-1 prediction belongs to {y A, y B}; Top-2 accuracy is calculated by counting the top2 predictions are equal to {y A, y B}. λr indicates the area ratio of x A in mixed image x M.

accuracy of mixed samples can be significantly improved by our method. Notably, for the top-2 accuracy, our Td Atten Mix achieves 20.92% while Saliency Mix (Uddin et al. 2021) only reaches 10.00%. This demonstrates that we obtain a mixed image consistent with the labels of source and target images.

4.2 Area-Attention Label Mixing To enhance the precision of the mixed label y M, based on the area ratio (Eq. 2) used by Cut Mix (Yun et al. 2019), we adjust the area ratio using the attention scores of αA and αB at their respective positions within the mixed image x M. More specifically, the final mixing ratio λ are defined as follows:

λr = hw P 2

p,q α is+p h

Att B = X αB X

p,q α it+p h

λa = Att A Att A + Att B (13)

λ = βλr + (1 β)λa (14)

λr is the area ratio of x A in mixed image x M, Att A and Att B is the sum of the task adaptive attention scores at the positions corresponding to x M in αA and αB, λa is the attention ratio of x A in mixed image x M, β = 0.5, λ is the final mixing ratio of x A in the mixed image x M. The final mixed label y M is then defined as follows:

y M = λy A + (1 λ)y B (15)

Then we obtain the new mixed training sample (x M, y M).

4.3 Training Objective Our Td Atten Mix framework is independent on any training model and can be used on various mainstream structures. When deployed on a Res Net-based architecture, we employ the standard classification loss and the consistency constraint losses proposed in SMMix (Chen et al. 2023). The

traditional classification loss is defined as follows, where YM is the prediction distribution of mixed images. :

Lcls = CE(YM, y M) (16)

Then we require feature consistency constraint losses (Chen et al. 2023), which help features of the mixed images fall into a consistent space with those of the original unmixed images. The feature consistency constraint losses in our Td Atten Mix is:

Lcon = L1(YM, λYA + (1 λ)YB) (17)

YA and YB is the prediction distributions of unmixed images x A and x B. Overall, the training loss is then written as follows: Ltotal = Lcls + Lcon (18) When deployed on a Vi T-based architecture, we use the same loss function like SMMix (Chen et al. 2023), which proves to be effective in learning features for mixed samples:

2(CE(YA, y A) + CE(YB, y B)) (19)

Ltotal = Lcls + Lfine + Lcon (20)

5 Experiments We evaluate Td Atten Mix in four aspects: 1) Evaluating image classification tasks on eight different benchmarks, 2) transferring pre-trained models to two downstream tasks, 3) Evaluating the robustness on three scenarios including occlusion and two out-of-distribution datasets. (4) In addition, we have conducted the first quantitative study on the effectiveness of saliency-based methods in reducing image-label inconsistency. Our Td Atten Mix is highlighted in gray, and bold denotes the best results.

5.1 Small-scale Classification In small-scale classification we use Res Net-18 (He et al. 2016) and Res Next-50 (Xie et al. 2017) to compare the performance. Hyperparameter settings are in the section 1 of the Supplementary. Table 1 shows small-scale classification results on CIFAR-100, Tiny-Image Net and CUB200. Compared to the previous SOTA methods, Td Atten Mix consistently surpasses Auto Mix (+0.08 +0.84), Puzzle Mix (+1.18 +2.97), Mainfold Mix (+0.34 +3.35) based on various Res Net architectures. Moreover, Td Atten Mix noticeably exhibits a significant gap with Saliency Mix (+2.50 + 4.25).

5.2 Image Net Classification Table 1 validates the performance advantage of Td Atten Mix over other methods. In particular, Td Atten Mix boosts the top-1 accuracy by more than +1% in Res Net-18 (He et al. 2016) and Deit-S (Touvron et al. 2021b) compared with the Saliency Mix baseline and achieves the sota result. It can be noted that Trans Mix, Token Mix and SMMix also exhibit good top-1 accuracy, but they are limited to Vi T-special methods, causing them incompatible with all mainstream architectures (e.g., Res Net). We provide more comparisons with Vi T-special methods in Section 2 of the Supplementary,

and additional experiments have proven the effectiveness of our Td Atten Mix. On the contrary, our Td Atten Mix is an independent data argumentation method which is compatible with mainstream architectures.

5.3 Downstream Tasks Semantic segmentation. We use ADE20k (Zhou et al. 2017) to evaluate the performance of semantic segmentation task. ADE20k is a challenging scene parsing dataset covering 150 semantic categories, with 20k, 2k, and 3k images for training, validation and testing. We evaluate Dei T backbones with Uper Net (Xiao et al. 2018). As shown in Table 2, Td Atten Mix improves Deit-S for +1.6% m Io U and +2.5% m Acc. Weakly supervised automatic segmentation (WSAS). We compute the Jaccard similarity over the PASCALVOC12 benchmark (Everingham et al. 2015). The attention masks generated from Td Atten Mix-Dei T-S or vanilla Dei TS are compared with ground-truth on the benchmark. The evaluated scores can quantitatively help us to understand if Td Atten Mix has a positive effect on the quality of attention map. As shown in Table 2, Td Atten Mix improves Deit-S for +3.3%.

5.4 Robustness Analysis Robustness to Occlusion. Naseer et al. (Naseer et al. 2021) studies whether Vi Ts perform robustly in occluded scenarios, where some of most of the image content is missing. Following (Naseer et al. 2021), we showcase the classification accuracy on Image Net-1k validation set with three dropping settings. (1) Random Patch Dropping. (2) Salient (foreground) Patch Dropping. (3) Non-salient (background) Patch Dropping. As depicted in Figure 5, Deit-S augmented with Td Atten Mix outperforms the standard Deit-S across all occlusion levels. Out-of-distribution Datasets. We evaluate our TDAtten Mix on two out-of-distribution datasets. (1) The Image Net A dataset (Hendrycks et al. 2021). The metric for assessing classifiers robustness to adversarially filtered examples includes the top-1 accuracy, Calibration Error (Calib Error) (Hendrycks et al. 2021; Kumar, Liang, and Ma 2019), and Area Under the Response Rate Accuracy Curve (AURRA) (Hendrycks et al. 2021). (2) The Image Net O (Hendrycks et al. 2021). The metric is the area under the precision-recall curve (AUPR) (Hendrycks et al. 2021). Table 3 indicates that Td Atten Mix can have consistent performance gains over vanilla Deit-S on the out-of-distribution data.

5.5 Image-label Inconsistency Analysis Previous image mixing methods did not quantitatively validate the image-label inconsistency.Motivated by the fact that gaze reflects human vision (Huang et al. 2020), we propose using the mixed label, which is based on gaze attention, as the ground-truth to validate the problem of imagelabel consistency. For our experiments, we utilize ARISTO dataset (Liu et al. 2022b) and the corresponding raw images. Since λ determines the mixed label, the image-label inconsistency can be represented by the difference between the λ

Dataset CIFAR100 Tiny-Image Net CUB-200 Image Net-1k Network R-18 RX-50 R-18 RX-50 R-18 RX-50 R-18 Deit-S Vanilla (Li et al. 2023) 78.04 81.09 61.68 65.04 77.68 83.01 70.04 75.70 Saliency Mix (Uddin et al. 2021) 79.12 81.53 64.60 66.55 77.95 83.29 69.16 79.88 Mix Up (Zhang et al. 2018) 79.12 82.10 63.86 66.36 78.39 84.58 69.98 79.65 Cut Mix (Yun et al. 2019) 78.17 81.67 65.53 66.47 78.40 85.68 68.95 79.78 Mainfold Mix (Verma et al. 2019) 80.35 82.88 64.15 67.30 79.76 86.38 69.98 - Smooth Mix (Lee et al. 2020) 78.69 80.68 66.65 69.65 - - - - Attentive Mix (Walawalkar et al. 2020) 78.91 81.69 64.85 67.42 - - 68.57 80.32 Puzzle Mix (Kim, Choo, and Song 2020) 81.13 82.85 65.81 67.83 78.63 84.51 70.12 80.45 Co-Mixup (Kim et al. 2020) 81.17 82.91 65.92 - - - - - Grid Mix (Baek, Bang, and Shim 2021) 78.72 81.11 65.14 66.53 - - - - Trans Mix (Chen et al. 2022) - - - - - - - 80.68 Token Mix (Liu et al. 2022a) - - - - - - - 80.80 SMMix (Chen et al. 2023) - - - - - - - 81.08 Auto Mix (Liu et al. 2022c) 82.04 83.64 67.33 70.72 79.87 86.56 70.50 80.78 Td Atten Mix (Ours) 82.36 84.03 67.47 70.80 80.71 86.72 70.74 81.19 Gain +3.24 +2.50 +2.87 +4.25 +2.76 +3.43 +1.58 +1.31

Table 1: Image classification top-1 accuracy (%) on CIFAR-100, Tiny-Image Net, CUB-200 and Image Net-1k. We get the performance of previous methods from the Open Mixup (Li et al. 2023) benchmark. Gain: indicates the performance improvement compared with Saliency Mix.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Deit-S Td Atten Mix-Deit-S (Ours)

Top1 Acc. (%)

Information loss

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Deit-S Td Atten Mix-Deit-S (Ours)

Top1 Acc. (%)

Information loss

Random Patch Dropping Saliency Patch Dropping

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Deit-S Td Atten Mix-Deit-S (Ours)

Top1 Acc. (%)

Information loss

Non-saliency Patch Dropping

Figure 5: Robustness against occlusion. Model robustness against occlusion with different information loss ratios is studied. 3 patch dropping settings: Random Patch Dropping (left), Salient Patch Dropping (middle), and Non-Salient Patch Dropping (right) are considered.

Semantic segmentation WSAS Models m Io U (%) m Acc (%) Segmentation JI (%) Deit-S 31.6 44.4 29.2 Td Atten Mix-Deit-S 33.2 46.9 32.5 Gain +1.6 +2.5 +3.3

Table 2: Downstream tasks. Transferring the pre-trained models to semantic segmentation task using Uper Net with Dei T backbone on ADE20k dataset; Segmentation JI denotes the jaccard index for weakly supervised automatic segmentation (WSAS) on Pascal VOC.

and ground truth λgt obtained by gaze attention for the same mixed image. So we define the metrics as:

Inconsistency = |λgt λ| (21)

λgt is calculated based on the real human gaze, λ is calculated based on different Cut Mix variants. As shown in Table 4, the inconsistency is effectively reduced for

Nat. Adversarial Example Out-of-Dist Models Top-1 Acc.Calib-Error AURRA AUPR Deit-S 19.1 32.0 23.8 20.9 Td Atten Mix-Deit-S 22.0 30.4 29.7 22.0 Gain +2.9 +1.6 +5.9 +1.1

Table 3: Model s robustness against natural adversarial examples on Image Net-A and out-of-distribution examples on Image Net-O.

saliency-based methods. Our Td Atten Mix are +7.8 higher than random based Cut Mix (Yun et al. 2019). The result of Td Atten Mix-Bottom-up using only bottom-up attention is close to the results obtained by Saliency Mix (Uddin et al. 2021). This may be due to neither Td Atten Mix-Bottomup nor Saliency Mix has task adaptive ability, thus imagelabel inconsistency will be stronger than our Td Atten Mix. These experimental findings strongly support the notion that

Method Inconsistency Cut Mix (Yun et al. 2019) 26.2 Saliency Mix (Uddin et al. 2021) 18.9 Td Atten Mix-Bottom-up 19.0 Td Atten Mix 18.4 Gain +7.8

Table 4: Image-label inconsistency of different saliencybased Cut Mix variants. Td Atten Mix-Bottom-up represents the settings of σ to 0 to control the task adaptive balanced attention of Td Atten Mix as the standard bottom-up attention. Gain: reduction of error.

CAM for 1st Class

CAM for 2nd Class

Td Atten Mix

Image A Image B Mixed Image

Figure 6: We show the class activation map (Selvaraju et al. 2017) of the models trained with Cut Mix and Td Atten Mix by testing on unmixed and mixed images, respectively. Left: locate objects in the unmixed images. Right: locate objects in the mixed images.

saliency-based Cut Mix variants enhance training by mitigating image-label inconsistency, with top-down attention being more effective than bottom-up attention.

5.6 Visualization In Figure 6, we visualize the class activation map (Selvaraju et al. 2017) of the models trained with Cut Mix and Td Atten Mix. As shown in the left of Figure 6 that the Td Atten Mix can locate object with more precision than the Cut Mix model in the unmixed images. Furthermore, the right of Figure 6 shows that for the mixed images, the Td Atten Mix model can accurately locate objects from two different images. On the contrary, Cut Mix model focuses only on the class of image A. Our Td Atten Mix is guided by task adaptive attention, which ensures that the information in the training data is sufficient enabling superior recognition capacity for mixed images.

5.7 Ablation Study We conduct an ablation study to analyze our proposed Td Atten Mix. We use Res Net-18 (He et al. 2016) as the backbone and train it on CUB-200 (Wah et al. 2011). Control of Task Adaptive Balanced Attention. Our Td Atten Mix balances top-down and bottom-up attention by adjusting the top-down signal Vtd, enabling a shift from standard bottom-up to top-down attention.We evaluate three different task adaptive balanced attention strategies: 1) σ =

Model σ Top-1 Acc.(%)

Res Net-18 (He et al. 2016)

0 80.31 0.5 80.60 1 80.71 2 80.29 3 79.89 4 79.50

Table 5: Control of task adaptive balanced attention. As shown in Eq. 3, the task adaptive balanced attention can be controlled by σ which when σ = 0 represents standard bottom-up attention.

Model β Top-1 Acc.(%)

Res Net-18 (He et al. 2016)

0 80.20 0.3 80.29 0.5 80.71 0.7 80.19 1 80.27 random 80.29

Table 6: Mix ratio β of area-attention label mixing.

0, 2) σ = 0.5, 3) σ = 1, 4) σ = 2, 5) σ = 3, 6) σ = 4. These strategies represent a gradual increase in task adaptive ability when bottom-up features are sufficient. This is consistent with the execution logic of human gaze, in which the top-down signal on top of the bottom-up features directs attention to achieve the best results. Mix ratio β of area-attention label mixing. β determines the ratio of area-attention label mixing. We evaluate the performance for several values of β: 1) fixed as 0, which means only the area ratio is used to mix labels, 2) fixed as 0.3, 3) fixed as 0.5 which assigns equal weighting to the area ratio and attention ratio, 4) fixed as 0.7, 5) fixed as 1, which means only the attention ratio is used to mix labels 6) random value of β, which means β as a random number between 0 and 1. Table 6 shows that the best results are obtained when β is set to 0.5.

6 Conclusion

This paper proposes Td Atten Mix, a general and effective data augmentation framework. Motivated by the superiority of human gaze, we simulate the task-guided mechanism of human gaze to modulate attention. Td Atten Mix introduces a new Top-down Attention Guided Module to balance bottom-up attention for task-related regions. Extensive experiments verify the effectiveness and robustness of Td Atten Mix, which significantly improves the performance on various datasets and backbones. Furthermore, we quantitatively validate that our method and saliency-based methods can efficiently reduce image-label inconsistency for the first time.

Acknowledgements

This work was supported by Beijing Natural Science Foundation (L242019). Dr. Lin Gu was also supported by JST Moonshot R&D Grant Number JPMJMS2011 Japan.

References Baek, K.; Bang, D.; and Shim, H. 2021. Grid Mix: Strong regularization through local context mapping. Pattern Recognition, 109: 107594. Belardinelli, A.; Herbort, O.; and Butz, M. V. 2015. Goaloriented gaze strategies afforded by object interaction. Vision research, 106: 47 57. Borji, A.; and Itti, L. 2012. State-of-the-art in visual attention modeling. IEEE transactions on pattern analysis and machine intelligence, 35(1): 185 207. Buswell, G. T. 1935. How people look at pictures: a study of the psychology and perception in art. Chen, C.-F. R.; Fan, Q.; and Panda, R. 2021. Cross Vi T: Cross-Attention Multi-Scale Vision Transformer for Image Classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 357 366. Chen, J.-N.; Sun, S.; He, J.; Torr, P. H.; Yuille, A.; and Bai, S. 2022. Trans Mix: Attend To Mix for Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12135 12144. Chen, M.; Lin, M.; Lin, Z.; Zhang, Y.; Chao, F.; and Ji, R. 2023. SMMix: Self-Motivated Image Mixing for Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 17260 17270. Cheng, Z.; Liang, J.; Choi, H.; Tao, G.; Cao, Z.; Liu, D.; and Zhang, X. 2022. Physical Attack on Monocular Depth Estimation with Optimal Adversarial Patches. In Avidan, S.; Brostow, G.; Ciss e, M.; Farinella, G. M.; and Hassner, T., eds., Computer Vision ECCV 2022, 514 532. Cham: Springer Nature Switzerland. Choi, H. K.; Choi, J.; and Kim, H. J. 2022. Tokenmixup: Efficient attention-guided token-level data augmentation for transformers. Advances in Neural Information Processing Systems, 35: 14224 14235. Connor, C. E.; Egeth, H. E.; and Yantis, S. 2004. Visual attention: bottom-up versus top-down. Current biology, 14(19): R850 R852. Cui, Y.; Yan, L.; Cao, Z.; and Liu, D. 2021. TF-Blender: Temporal Feature Blender for Video Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 8138 8147. Cui, Z.; Zhu, Y.; Gu, L.; Qi, G.-J.; Li, X.; Zhang, R.; Zhang, Z.; and Harada, T. 2022. Exploring Resolution and Degradation Clues as Self-supervised Signal for Low Quality Object Detection. In Avidan, S.; Brostow, G.; Ciss e, M.; Farinella, G. M.; and Hassner, T., eds., Computer Vision ECCV 2022, 473 491. Cham: Springer Nature Switzerland. Dabouei, A.; Soleymani, S.; Taherkhani, F.; and Nasrabadi, N. M. 2021. Super Mix: Supervising the Mixing Data Augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13794 13803. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; and Houlsby, N. 2021.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. Everingham, M.; Eslami, S. A.; Van Gool, L.; Williams, C. K.; Winn, J.; and Zisserman, A. 2015. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111: 98 136. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770 778. Hendrycks, D.; Zhao, K.; Basart, S.; Steinhardt, J.; and Song, D. 2021. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15262 15271. Huang, Y.; Cai, M.; Li, Z.; Lu, F.; and Sato, Y. 2020. Mutual context network for jointly estimating egocentric gaze and action. IEEE Transactions on Image Processing, 29: 7795 7806. Jiang, M.; Huang, S.; Duan, J.; and Zhao, Q. 2015. Salicon: Saliency in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1072 1080. Kim, J.; Choo, W.; Jeong, H.; and Song, H. O. 2020. Co Mixup: Saliency Guided Joint Mixup with Supermodular Diversity. In International Conference on Learning Representations. Kim, J.-H.; Choo, W.; and Song, H. O. 2020. Puzzle mix: Exploiting saliency and local statistics for optimal mixup. In International Conference on Machine Learning, 5275 5285. PMLR. Koch, C.; and Ullman, S. 1985. Shifts in selective visual attention: towards the underlying neural circuitry. Human neurobiology, 4(4): 219 227. Kumar, A.; Liang, P. S.; and Ma, T. 2019. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32. Lee, J.-H.; Zaheer, M. Z.; Astrid, M.; and Lee, S.-I. 2020. Smoothmix: a simple yet effective data augmentation to train robust classifiers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 756 757. Li, S.; Wang, Z.; Liu, Z.; Wu, D.; Tan, C.; Jin, W.; and Li, S. Z. 2023. Open Mixup: A Comprehensive Mixup Benchmark for Visual Classification. ar Xiv:2209.04851. Liu, D.; Cui, Y.; Tan, W.; and Chen, Y. 2021a. SG-Net: Spatial Granularity Network for One-Stage Video Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9816 9825. Liu, D.; Cui, Y.; Yan, L.; Mousas, C.; Yang, B.; and Chen, Y. 2021b. Densernet: Weakly supervised visual localization using multi-scale feature aggregation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 6101 6109. Liu, J.; Liu, B.; Zhou, H.; Li, H.; and Liu, Y. 2022a. Token Mix: Rethinking Image Mixing for Data Augmentation in

Vision Transformers. In European Conference on Computer Vision, 455 471. Liu, Y.; Zhou, L.; Zhang, P.; Bai, X.; Gu, L.; Yu, X.; Zhou, J.; and Hancock, E. R. 2022b. Where to focus: Investigating hierarchical attention relationship for fine-grained visual classification. In European Conference on Computer Vision, 57 73. Springer. Liu, Z.; Li, S.; Wu, D.; Liu, Z.; Chen, Z.; Wu, L.; and Li, S. Z. 2022c. Auto Mix: Unveiling the Power of Mixup for Stronger Classifiers. In European Conference on Computer Vision, 441 458. Montabone, S.; and Soto, A. 2010. Human detection using a mobile platform and novel features derived from a visual saliency mechanism. Image and Vision Computing, 28(3): 391 402. Naseer, M. M.; Ranasinghe, K.; Khan, S. H.; Hayat, M.; Shahbaz Khan, F.; and Yang, M.-H. 2021. Intriguing properties of vision transformers. Advances in Neural Information Processing Systems, 34: 23296 23308. Riche, N.; Duvinage, M.; Mancas, M.; Gosselin, B.; and Dutoit, T. 2013. Saliency and human fixations: State-of-theart and study of comparison metrics. In Proceedings of the IEEE international conference on computer vision, 1153 1160. Schwinn, L.; Precup, D.; Eskofier, B.; and Zanca, D. 2022. Behind the Machine s Gaze: Neural Networks with Biologically-inspired Constraints Exhibit Human-like Visual Attention. Transactions on Machine Learning Research. Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, 618 626. Shi, B.; Darrell, T.; and Wang, X. 2023. Top-Down Visual Attention from Analysis by Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2102 2112. Tan, C.; Gao, Z.; Wu, L.; Li, S.; and Li, S. Z. 2022. Hyperspherical Consistency Regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7244 7255. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; and J egou, H. 2021a. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, 10347 10357. PMLR. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; and J egou, H. 2021b. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, 10347 10357. PMLR. Treisman, A. M.; and Gelade, G. 1980. A feature-integration theory of attention. Cognitive psychology, 12(1): 97 136. Uddin, A. F. M. S.; Monira, M. S.; Shin, W.; Chung, T.; and Bae, S.-H. 2021. Saliency Mix: A Saliency Guided Data Augmentation Strategy for Better Regularization. In International Conference on Learning Representations.

Verma, V.; Lamb, A.; Beckham, C.; Najafi, A.; Mitliagkas, I.; Lopez-Paz, D.; and Bengio, Y. 2019. Manifold mixup: Better representations by interpolating hidden states. In International conference on machine learning, 6438 6447. PMLR. Vuyyuru, M. R.; Banburski, A.; Pant, N.; and Poggio, T. 2020. Biologically inspired mechanisms for adversarial robustness. Advances in Neural Information Processing Systems, 33: 2135 2146. Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The caltech-ucsd birds-200-2011 dataset. Walawalkar, D.; Shen, Z.; Liu, Z.; and Savvides, M. 2020. Attentive Cutmix: An Enhanced Data Augmentation Approach for Deep Learning Based Image Classification. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing-Proceedings. Wu, L.; Lin, H.; Tan, C.; Gao, Z.; and Li, S. Z. 2023. Self Supervised Learning on Graphs: Contrastive, Generative, or Predictive. IEEE Transactions on Knowledge and Data Engineering, 35(4): 4216 4235. Xia, J.; Zhu, Y.; Du, Y.; and Li, S. Z. 2022. Pre-training Graph Neural Networks for Molecular Representations: Retrospect and Prospect. In ICML 2022 2nd AI for Science Workshop. Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; and Sun, J. 2018. Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), 418 434. Xie, S.; Girshick, R.; Doll ar, P.; Tu, Z.; and He, K. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1492 1500. Yarbus, A. L. 2013. Eye movements and vision. Springer. Yun, S.; Han, D.; Oh, S. J.; Chun, S.; Choe, J.; and Yoo, Y. 2019. Cut Mix: Regularization Strategy to Train Strong Classifiers With Localizable Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Zang, Z.; Li, S.; Wu, D.; Wang, G.; Wang, K.; Shang, L.; Sun, B.; Li, H.; and Li, S. Z. 2022. DLME: Deep Local Flatness Manifold Embedding. In Avidan, S.; Brostow, G.; Ciss e, M.; Farinella, G. M.; and Hassner, T., eds., Computer Vision ECCV 2022, 576 592. Cham: Springer Nature Switzerland. Zhang, H.; Cisse, M.; Dauphin, Y. N.; and Lopez-Paz, D. 2018. mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations. Zhao, Z.; Wu, Z.; Zhuang, Y.; Li, B.; and Jia, J. 2022. Tracking objects as pixel-wise distributions. In European Conference on Computer Vision, 76 94. Springer. Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; and Torralba, A. 2017. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, 633 641.