# sequencer_deep_lstm_for_image_classification__f05d8c9b.pdf Sequencer: Deep LSTM for Image Classification Yuki Tatsunami1,2 Masato Taki1 1Rikkyo University, Tokyo, Japan 2Any Tech Co., Ltd., Tokyo, Japan {y.tatsunami, taki_m}@rikkyo.ac.jp In recent computer vision research, the advent of the Vision Transformer (Vi T) has rapidly revolutionized various architectural design efforts: Vi T achieved stateof-the-art image classification performance using self-attention found in natural language processing, and MLP-Mixer achieved competitive performance using simple multi-layer perceptrons. In contrast, several studies have also suggested that carefully redesigned convolutional neural networks (CNNs) can achieve advanced performance comparable to Vi T without resorting to these new ideas. Against this background, there is growing interest in what inductive bias is suitable for computer vision. Here we propose Sequencer, a novel and competitive architecture alternative to Vi T that provides a new perspective on these issues. Unlike Vi Ts, Sequencer models long-range dependencies using LSTMs rather than self-attention layers. We also propose a two-dimensional version of Sequencer module, where an LSTM is decomposed into vertical and horizontal LSTMs to enhance performance. Despite its simplicity, several experiments demonstrate that Sequencer performs impressively well: Sequencer2D-L, with 54M parameters, realizes 84.6% top-1 accuracy on only Image Net-1K. Not only that, we show that it has good transferability and the robust resolution adaptability on double resolution-band. Our source code is available at https://github.com/okojoalg/sequencer. 1 Introduction Figure 1: IN-1K top-1 accuracy v.s. model parameters. All models are trained on IN-1K at resolution 2242 from scratch. The de-facto standard for computer vision has been convolutional neural networks (CNNs) [39, 64, 22, 65, 66, 9, 29, 67]. However, inspired by the many breakthroughs in natural language processing (NLP) achieved by Transformers [75, 35, 57], applications of Transformers for computer vision are now being actively studied. In particular, Vision Transformer (Vi T) [16] is a pure Transformer applied to image recognition and achieves performance competitive with CNNs. Various studies triggered by Vi T have shown that the state-of-the-art (SOTA) performance can be achieved for a wide range of vision tasks using self-attention alone [79, 48, 73, 47, 15], without convolution. The reason for this success is thought to be due to the ability of self-attention to model long-range dependencies. However, it is still unclear how essential the self-attention is to the effectiveness of Transformers for vision tasks. Indeed, the MLP-Mixer [70] based only on multi-layer perceptrons (MLPs) is proposed as an appealing alternative to Vision Trans- 36th Conference on Neural Information Processing Systems (Neur IPS 2022). formers (Vi Ts). In addition, some studies [49, 14] have shown that carefully designed CNNs are still competitive enough with Transformers in computer vision. Therefore, identifying which architectural designs are inherently effective for computer vision tasks is of great interest for current research [83]. This paper provides a new perspective on this issue by proposing a novel and competitive alternative to these vision architectures. We propose the Sequencer architecture, that uses the long short-term memory (LSTM) [27] rather than the self-attention for sequence modeling. The macro-architecture design of Sequencer follows Vi Ts, which iteratively applies token mixing and channel mixing, but the self-attention layer is replaced by one based on LSTMs. In particular, Sequencer uses bidirectional LSTM (Bi LSTM) [63] as a building block. While simple Bi LSTM shows a certain level of performance, Sequencer can be further improved by using ideas similar to Vision Permutator (Vi P) [28]. The key idea in Vi P is to process the vertical and horizontal axes in parallel. We also introduce two Bi LSTMs for top/bottom and left/right directions in parallel. This modification improves the efficiency and accuracy of Sequencer because this structure reduces the length of the sequence and yields a spatially meaningful receptive field. When pre-trained on Image Net-1K (IN-1K) dataset, our new attention-free architecture outperforms advanced architectures such as Swin [48] and Conv Ne Xt [49] of comparable size, see Figure 1. It also outperforms other attention-free and CNN-free architectures such as MLP-Mixer [70] and GFNet [61], making Sequencer an attractive new alternative to the self-attention mechanism in vision tasks. This study also aims to propose novel architecture with practicality by employing LSTM for spatial pattern processing. Notably, Sequencer exhibits robust resolution adaptability, which strongly prevents accuracy degradation even when the input s resolution is increased double during inference. Moreover, fine-tuning Sequencer on high-resolution data can achieve higher accuracy than Swin-B [48] and Sequencer is also useful for semantic segmentation. On peak memory, Sequencer tends to be more economical than Vi Ts and recent CNNs for high-resolution input. Although Sequencer requires more FLOPs than other models due to recursion, the higher resolution improves the relative efficiency of peak memory, enhancing the accuracy/cost trade-off at a high-resolution regime. Therefore, Sequencer also has attractive properties as a practical image recognition model. 2 Related works Inspired by the success of Transformers in NLP [75, 35, 57, 58, 3, 60], various applications of self-attention have been studied in computer vision. For example, in i GPT [6], an attempt was made to apply autoregressive pre-training with causal self-attention [57] to image classification. However, due to the computational cost of pixel-wise attention, it could only be applied to low-resolution images, and its Image Net classification performance was significantly inferior to the SOTA. Vi T [16], on the other hand, quickly brought Transformer s image classification performance closer to SOTA with its idea of applying bidirectional self-attention [35] to image patches rather than pixels. Various architectural and training improvements [72, 84, 79, 90, 48, 73, 5] have been attempted for Vi T [16]. In this paper, we do not improve self-attention itself but propose a completely new module for image classification to replace it. The extent to which attention-based cross-token communication inherently contributes to Vi T s success is not yet well understood, starting with MLP-Mixer [70], which completely replaced Vi T s self-attention with MLP, various MLP-based architectures [71, 46, 28, 69, 68, 13] have achieved competitive performance on the Image Net dataset. We refer to these architectures as global MLPs (GMLPs) because they have global receptive fields. This series of studies cast doubt on the need for self-attention. From a practical standpoint, however, these MLP-based models have a drawback: they need to be finetuned to cope with flexible input sizes during inference by modifying the shape of their token-mixing MLP blocks. This resolution adaptability problem has been improved in Cycle MLP [7], for example, by the idea of realizing a local kernel with a cyclic MLP. There are similar ideas such as [82, 81, 42, 21] which are collectively referred to as local MLPs (LMLPs). Besides the MLP-based idea, several other interesting self-attention alternatives have been found. GFNet [61] uses Fourier transformation of the tokens and mixes the tokens by global filtering in the frequency domain. Pool Former [83], on the other hand, achieved competitive performance with only local pooling of tokens, demonstrating that simple local operations are also a suitable alternative. Our proposed Sequencer is a new alternative to self-attention that differs from both of the above, and Sequencer is an attempt to realize token mixing in vision architectures using only LSTM. It achieved competitive performance with SOTA on the IN-1K benchmark, especially with an architecture that can flexibly adapt to higher resolution. The idea of spatial axis decomposition has been used several times in neural architecture in computer vision. For example, Squeeze Ne Xt [17] decomposes a 3x3 convolution layer into 1x3 and 3x1 convolution layers, resulting in a lightweight model. Criss-cross attention [31] reduces memory usage and computational complexity by restricting the attention to only vertical and horizontal portions. Current architectures such as CSwin [15], Couplformer [40], Vi P [28], Raft MLP [69], Sparse MLP [68], and Morph MLP [86] have included similar ideas to improve efficiency and performance. In the early days of deep learning, there were attempts to use RNNs for image recognition. The earliest study that applied RNNs to image recognition is [19]. The primary difference between our study and [19] is that we utilize a usual RNN in place of a 2-multi-dimensional RNN(2MDRNN). The 2MDRNN requires H + W sequential operations; The LSTM requires H sequential operations, where H and W are height and width, respectively. For subsequent work on image recognition using 2MDRNNs, see [20, 32, 4, 43]. [4] proposed an architecture in which information is collected from four directions (upper left, lower left, upper right, and lower right) by RNNs for understanding natural scene images. [43] proposed a novel 2MDRNN for semantic object parsing that integrates global and local context information, called LG-LSTM. The overall architecture design is structured to input deep Conv Net features into the LG-LSTM, unlike Sequencer which stacks LSTMs. Re Net [77] is most relevant to our work; Re Net [77] uses a 4-way LSTM and non-overlapping patches as input. In this respect, it is similar to Sequencer. Meanwhile, there are three differences. First, Sequencer is the first Meta Former [83] realized by adopting LSTM as the token mixing block. Sequencer also adopts a larger patch size than Re Net [77]. The benefit of adopting these designs is that we can modernize LSTM-based vision architectures and fairly compare LSTM-based models with Vi T. As a result, our results provide further evidence for the extremely interesting hypothesis Meta Former [83]. Second, the way vertical Bi LSTMs and horizontal Bi LSTMs are connected is different. Our work connects them in parallel, allowing us to gather vertical and horizontal information simultaneously. On the other hand, in Re Net [77], the output of the horizontal Bi LSTM is used as input to the vertical Bi LSTM. Finally, we trained Sequencer on large datasets such as Image Net, whereas Re Net [77] is limited to small datasets as MNIST [41], CIFAR-10 [38], and SVHN [54], and has not shown the effectiveness of LSTM for larger datasets. Re Seg [76] applied Re Net to semantic segmentation. RNNs have been applied not only to image recognition, but also to generative models: Pixcel RNN [74] is a pixel-channel autoregressive generative model of images using Row RNN, which consists of a 1D-convolution and a usual RNN, and Diagonal Bi LSTM, which is computationally expensive. In NLP, attempts have been made to avoid the computational cost of attention by approximating causal self-attention with recurrent neural network (RNN) [34] or replacing it with RNN after training [33]. In particular, in [34], an autoregressive pixel-wise image generation task is experimented with an architecture where the attentions in i GPT are approximated by RNNs. These studies are specific to unidirectional Transformers, in contrast to our token-based Sequencer which is the bidirectional analog of them. In this section, we briefly recap the preliminary background on LSTM and further describe the details of the proposed architectures. 3.1 Preliminaries: Long short-term memory LSTM [27] is a specialized recurrent neural network (RNN) for modeling long-term dependencies of sequences. Plain LSTM has an input gate it that controls the storage of inputs, a forget gate ft that controls the forgetting of the former cell state ct 1 and an output gate ot that controls the cell output ht from the current cell state ct. Plain LSTM is formulated as follows: it = σ (Wxixt + Whiht 1 + bi) , ft = σ (Wxfxt + Whfht 1 + bf) , (1) ct = ft ct 1 + it tanh (Wxcxt + Whcht 1 + bc) , ot = σ (Wxoxt + Whoht 1 + bo) , (2) ht = ot tanh(ct), (3) where σ is the logistic sigmoid function and is Hadamard product. (a) Sequencer (b) Bi LSTM2D layer (c) Transformer block (d) Vanilla Sequencer block (e) Sequencer2D block Figure 2: (a) The architecture of Sequencers; (b) The figure outlines the Bi LSTM2D layer, which is the main component of Sequencer2D. (c) Transformer block consist of multi-head attention. In contrast, (d) Vanilla Sequencer block and (e) Sequencer2D block, utilized on our archtecture, composed of Bi LSTM or Bi LSTM2D instead of multi-head attention. Bi LSTM [63] is profitable for sequences where mutual dependencies are expected. A Bi LSTM consists of two plain LSTMs. Let !x be the input series and x be the rearrangement of !x in reverse order. ! hfor and hback are the outputs obtained by processing !x and x with the corresponding LSTMs, respectively. Let ! hback be the output hback rearranged in the original order, and the output of Bi LSTM is obtained as follows: ! hfor, hback = LSTMfor( !x ), LSTMback( x ), h = concatenate( ! hfor, ! hback) . (4) Assume that both ! hfor and ! hback have the same hidden dimension D, which is hyperparameter of Bi LSTM. Accordingly, vector h has dimension 2D. 3.2 Sequencer architecture Overall architecture In the last few years, Vi T and its many variants based on self-attention [16, 72, 48, 91] have attracted much attention in computer vision. Following these, several works [70, 71, 46, 28] have been proposed to replace self-attention with MLP. There have also been studies of replacing self-attention with a hard local induced bias module [7, 83] and with a global filter [61] using the fast Fourier transform algorithm (FFT) [10]. This paper continues this trend and attempts to replace the self-attention layer with LSTM [27]: we propose a new architecture aiming at memory saving by mixing spatial information with LSTM, which is memory-economical compared to Vi T, parameter-saving, and has the ability to learn long-range dependencies. Figure 2a shows the overall structure of Sequencer architecture. Sequencer architecture takes nonoverlapping patches as input and projects them onto the feature map. Sequencer block, which is a core component of Sequencer, consists of the following sub-components: (1) Bi LSTM layer can mix spatial information more memory-economically for high-resolution images than Transformer layer and more globally than CNN. (2) MLP for channel-mixing as well as [16, 70]. Sequencer block is called Vanilla Sequencer block when plain Bi LSTM layers are used as Bi LSTM layers as Figure 2d and Sequencer2D block when Bi LSTM2D layers are used as Figure 2e. We define Bi LSTM2D layer later. The output of the last block is sent to the linear classifier via the global average pooling layer, as in most other architectures. Bi LSTM2D layer We propose the Bi LSTM2D layer as a technique to mix 2D spatial information efficaciously. It has two plain Bi LSTMs: a vertical Bi LSTM and a horizontal one. For an input X 2 RH W C, {X:,w,: 2 RH C}W w=1 is viewed as a set of sequences, where H is the number of tokens in the vertical direction, W is the number of sequences in the horizontal direction, and C is the channel dimension. All sequences X:,w,: are input into the vertical Bi LSTM with shared weights and hidden dimension D: :,w,: = Bi LSTM(X:,w,:). (5) In a very similar manner, {Xh,:,: 2 RW C}H h=1 is viewed as a set of sequences, and all sequences Xh,:,: are input into the horizontal Bi LSTM with shared weights and hidden dimension D as well: h,:,: = Bi LSTM(Xh,:,:). (6) We combine {Hver :,w,: 2 RH 2D}W w=1 into Hver 2 RW H 2D and {Hhor h,:,: 2 RW 2D}H h=1 into Hhor 2 RW H 2D. They are then concatenated and processed point-wisely in a fully-connection layer. These processes are formulated as follows: H = concatenate(Hver, Hhor), ˆX = FC(H), (7) where FC( ) denotes the fully-connected layer with weight W 2 RC 4D. The Py Torch-like pseudocode is shown in Appendix B.1. Bi LSTM2D is more memory-economical and throughput-efficiency than multi-head-attention of Vi T for high-resolution input. Bi LSTM2D involves (WC + HC)/2 dimensional cell states, while a multi-head-attention involves h (HW)2 dimensional attention map where h is a number of heads. Thus, as H and W increase, the memory cost of an attention map increases more rapidly than the cost of a cell state. On throughput, the computational complexity of self-attention is O(W 4C), whereas the computational complexity of Bi LSTM is O(WC2) where we assume W = H for simplicity. There are O(W) sequential operations for Bi LSTM2D. Therefore, assuming we use a sufficiently efficient LSTM cell implementation, such as official Py Torch LSTMs we are using, the increase of the complexity of self-attention is much more rapid than Bi LSTM2D. It implies a lower throughput of attention compared to Bi LSTM2D. See an experiment in Section 4.5. Architecture variants For comparison between models of different depths consisting of Sequencer2D blocks, we have prepared three models with different depths: 18, 24, and 36. The names of the models are Sequencer2D-S, Sequencer2D-M, and Sequencer2D-L, respectively. The hidden dimension is set to D = C/4. Details of these models are provided in Appendix B.2. As shown in subsection 4.1, these architectures outperform typical models. Interestingly, however, subsection 4.3 shows that replacing Sequencer2D block with the simpler Vanilla Sequencer block maintains moderate accuracy. We denote such a model as Vanilla Sequencer. Note that some of the explicit positional information is lost in the Vanilla Sequencer because the model treats patches as a 1D sequence. 4 Experiments In this section, we compare Sequencers with previous studies on the IN-1K benchmark [39]. We also carry out ablation studies, transfer learning studies, and analysis of the results to demonstrate the effectiveness of Sequencers. We adopt Py Torch [56] and timm [80] library to implement models in the conduct of all experiments. See Appendix B for more setup details. 4.1 Scratch training on IN-1K We utilize IN-1K [39], which has 1000 classes and contains 1,281,167 training images and 50,000 validation images. We adopt Adam W optimizer [50]. Following the previous study [72], we adopt the base learning rate batch size 512 5 10 4. The batch sizes for Sequencer2D-S, Sequencer2D-M, and Sequencer2D-L are 2048, 1536, and 1024, respectively. As a regularization method, stochastic depth [30] and label smoothing [66] are employed. As data augmentation methods, mixup [87], cutout [12], cutmix [85], random erasing [88], and randaugment [11] are applied. Table 1: The table shows the top-1 accuracy when trained on IN-1K, comparing our model with other similar scale representative models. Training and inference throughput and their peak memory were measured with 16 images per batch on a single V100 GPU. The left sides of the slashes are values during training, and the right sides of the slashes are values during inference. Fine-tuned models marked with """. Note Sequencer2D-L" are compared to Swin-B" and Conv Ne Xt-B" with more parameters since Swin and Conv Ne Xt have not fine-tuned models of similar parameters with Sequencer2D-L" in the original papers. Model Family Res. #Param. FLOPs Throughput Peak Mem. Top-1 Pre FT Top-1 (image/s) (MB) Acc.(%) Acc.(%) Training from scratch Reg Net Y-4GF [59] CNN 2242 21M 4.0G 228/823 1136/225 80.0 non-fine-tune Conv Ne Xt-T [49] CNN 2242 29M 4.5G 337/1124 1418/248 82.1 as above Dei T-S [72] Trans. 2242 22M 4.6G 480/1569 1195/180 79.9 as above Swin-T [48] Trans. 2242 28M 4.5G 268/894 1613/308 81.2 as above Vi P-S/7 [28] GMLP 2242 25M 6.9G 214/702 1587/195 81.5 as above Cycle MLP-B2 [7] LMLP 2242 27M 3.9G 158/586 1357/234 81.6 as above Pool Former-S24 [83] LMLP 2242 21M 3.6G 313/988 1461/183 80.3 as above Sequencer2D-S Seq. 2242 28M 8.4G 110/347 1799/196 82.3 as above Reg Net Y-8GF [59] CNN 2242 39M 8.0G 211/751 1776/333 81.7 as above T2T-Vi Tt-19 [84] Trans. 2242 39M 9.8G 197/654 3520/1140 82.2 as above Cycle MLP-B3 [7] LMLP 2242 38M 6.9G 100/367 2326/287 82.6 as above Pool Former-S36 [83] LMLP 2242 31M 5.2G 213/673 2187/220 81.4 as above GFNet-H-S [61] FFT 2242 32M 4.5G 227/755 1740/282 81.5 as above Sequencer2D-M Seq. 2242 38M 11.1G 83/270 2311/244 82.8 as above Reg Net Y-12GF [59] CNN 2242 46M 12.0G 199/695 2181/440 82.4 as above Conv Ne Xt-S [49] CNN 2242 50M 8.7G 212/717 2265/341 83.1 as above Swin-S [48] Trans. 2242 50M 8.7G 165/566 2635/390 83.2 as above Mixer-B/16 [70] GMLP 2242 59M 12.7G 338/1011 1864/407 76.4 as above Vi P-M/7 [28] GMLP 2242 55M 16.3G 130/395 3095/396 82.7 as above Cycle MLP-B4 [7] LMLP 2242 52M 10.1G 70/259 3272/338 83.0 as above Pool Former-M36 [83] LMLP 2242 56M 9.1G 171/496 3191/368 82.1 as above GFNet-H-B [61] FFT 2242 54M 8.4G 144/482 2776/367 82.9 as above Sequencer2D-L Seq. 2242 54M 16.6G 54/173 3516/322 83.4 as above Fine-tuning Conv Ne Xt-B" [49] CNN 3842 89M 45.1G 78/234 7329/870 85.1(+1.3) 83.8 Swin-B" [48] Trans. 3842 88M 47.1G 54/156 12933/1532 84.5(+1.0) 83.5 GFNet-B" [61] FFT 3842 47M 23.2G 137/390 3710/416 82.1(+0.8) 82.9 Sequencer2D-L" Seq. 3922 54M 50.7G 26/84 9062/481 84.6(+1.2) 83.4 Table 1 shows the results that are comparing the proposed models to others with a comparable number of parameters to our models, including models with local and global receptive fields such as CNNs, Vi Ts, and MLP-based and FFT-based models. Sequencers has the disadvantage that its throughput is slower than other models because it uses RNNs. In the scratch training on IN-1K, however, they outperform these recent comparative models in accuracy across their parameter bands. In particular, Seqeuncer2D-L is competitive with recently discussed models with comparable parameters such as Conv Ne Xt-S [49] and Swin-S [48], with accuracy outperformance of 0.3% and 0.2%, respectively. Table 1 demonstrates that Sequencer s throughput is not good. The training throughput is about three times the inference throughput for all these models. Compared to other models, both measured inference and training time are not good. 4.2 Fine-tuning on IN-1K In this fine-tuning study, Sequencer2D-L pre-trained on IN-1K at 2242 resolution is fine-tuned on IN-1K at 3922 resolution. We compare it with the other models fine-tuned on IN-1K at 3842 Table 2: Sequencer ablation experiments. We adopt Sequencer2D-S variant for these ablation studies. C1 denotes vertical Bi LSTM, C2 denotes horizontal Bi LSTM, and C3 denotes channel fusion component. When vertical Bi LSTM only, horizontal Bi LSTM only or unidirectional Bi LSTM2D, its hidden dimension needs to be doubled from the original setting because it compensates the output dimension for the excluded LSTM and matches the dimensions. (a) Components C1 C2 C3 Acc. X 75.6 X 75.0 X X 81.6 X X X 82.3 (b) LSTM Direction Bidirectional Acc. 79.7 X 82.3 (c) Vanilla Sequencer Model #Params. FLOPs Acc. VSequencer-S 33M 8.4G 78.0 VSequencer(H)-S 28M 8.4G 78.8 VSequencer(PE)-S 33M 8.4G 78.1 Sequencer2D-S 28M 8.4G 82.3 (d) Hidden dimension Hidden dim. ratio #Params. FLOPs Acc. 1x 28M 8.4G 82.3 2x 45M 13.9G 82.6 (e) Various RNNs Model #Params. FLOPs Acc. RNN-Sequencer2D 19M 5.8G 80.6 GRU-Sequencer2D 25M 7.5G 82.3 Seqeucer2D-S 28M 8.4G 82.3 resolution. Since 392 is divisible by 14, the input at this resolution can be split into patches without padding. However, note that this is not the case with a resolution of 3842. As Table 1 indicates, even when higher-resolution Sequencer is fine-tuned, it is competitive with the latest models such as Conv Ne Xt [49], Swin [48], and GFNet [61]. 4.3 Ablation studies This subsection presents ablation studies based on Sequencer2D-S for further understanding of Sequencer. We seek to clarify the effectiveness and validity of the Sequencers architecture in terms of the importance of each component, bidirectional necessaries, setting of the hidden dimension, and the comparison with simple Bi LSTM. We show where and how relevant the components of Bi LSTM2D are: The Bi LSTM2D is composed of vertical Bi LSTM, horizontal Bi LSTM, and channel fusion elements. We want to see the validity of vertical Bi LSTM, horizontal Bi LSTM, and channel fusion. For this purpose, we examine the removal of channel fusion and vertical or horizontal Bi LSTM. Table 2a shows the results. Removing channel fusion shows that the performance degrades from 82.3% to 81.6%. Furthermore, the additional removal of vertical or horizontal Bi LSTM exposes a 6.0% or 6.6% performance drop, respectively. Hence, each component discussed here is necessary for Sequencer2D. We show that the bidirectionality for Bi LSTM2D is important for Sequencer. We compare Sequencer2D-S with a version that replaces the vertical and horizontal Bi LSTMs with vertical and horizontal unidirectional LSTMs. Table 2b shows that the unidirectional model is 2.6% less accurate than the bidirectional model. This result attests to the significance of using not unidirectional LSTM but Bi LSTM. It is important to set the hidden dimension of LSTM to a reasonable size. As described in subsection 3.2, Sequencer2D sets the hidden dimension D of Bi LSTM to D = C/4, but this is not necessary if the model has channel fusions. Table 2d compares Sequencer2D-S with the model with increased D. Although accuracy is 0.3% improved, FLOPs increase by 65%, and the number of parameters increases by 60%. Namely, the accuracy has not improved for the increase in FLOPs. Moreover, the increase in dimension causes overfitting, which is discussed in Appendix C.3. Vanilla Sequencer can also achieve accuracy that outperforms MLP-Mixer [70], but is not as accurate as Sequencer2D. Following experimental result supports the claim. We experiment with the Sequencer2D-S variants, where Vanilla Sequencer blocks replace the Sequencer2D blocks, called VSequencer-S(H), with incomplete positional information. In addition, we experiment with a variant of VSequencer-S(H) without the hierarchical structure, which we call VSequencer-S. VSequencer S(PE) is VSequencer-S using Vi Ts-style learned positional embedding (PE) [16]. Table 2c indicates effectiveness for combination of LSTM and Vi Ts-like architecture. Surprisingly, even with Vanilla Table 3: Left. Results on transfer learning. We transfer models trained on IN-1K to datasets from different domains. Sequencers use 2242 resolution images, while Vi T-B/16 and Efficient Net-B7 work on higher resolution, see Res. column. Right. Semantic segmentation results on ADE20K [89]. All models are Semantic FPN [36] based. We show m Io U for the ADE20k validation set. Model Res. #Pr. FLOPs CF10 CF100 Flowers Cars Res Net50 [22] 2242 26M 4.1G - - 96.2 90.0 EN-B7 [67] 6002 26M 37.0G 98.9 91.7 98.8 94.7 Vi T-B/16 [16] 3842 86M 55.4G 98.1 87.1 89.5 - Dei T-B [72] 2242 86M 17.5G 99.1 90.8 98.4 92.1 Cai T-S-36 [73] 2242 68M 13.9G 99.2 92.2 98.8 93.5 Res MLP-24 [71] 2242 30M 6.0G 98.7 89.5 97.9 89.5 GFNet-H-B [61] 2242 54M 8.6G 99.0 90.3 98.8 93.2 Sequencer2D-S 2242 28M 8.4G 99.0 90.6 98.2 93.1 Sequencer2D-M 2242 38M 11.1G 99.1 90.8 98.2 93.3 Sequencer2D-L 2242 54M 16.6G 99.1 91.2 98.6 93.1 Model #Pr. m Io U PVT-Small [79] 28M 39.8 Pool Former-S24 [83] 23M 40.3 Sequencer2D-S 32M 46.1 PVT-Medium [79] 48M 41.6 Pool Former-S36[83] 35M 42.0 Sequencer2D-M 42M 47.3 PVT-Large [79] 65M 42.1 Pool Former-M36 [83] 60M 42.4 Sequencer2D-L 58M 48.6 Sequencer and Vanilla Sequencer(H) without PE, the performance reduction from Sequencer2D-S is only 4.3% and 3.5%, respectively. According to these results, there is no doubt that Vanilla Sequencer using Bi LSTMs is significant enough, although not as accurate as Sequencer2D. All LSTMs in the Bi LSTM2D layer can be replaced with other recurrent networks such as gated recurrent units (GRUs) [8] or tanh-RNNs to define Bi GRU2D layer or Bi RNN2D layer. We also trained these models on IN-1K, so see Table 2e for the results. The table suggests that all of these variants, including RNN-cell, work well. Also, tanh-RNN performs slightly worse than others, probably due to its lower ability to model long-range dependence. 4.4 Transfer learning and semantic segmentation Sequencers perform well on IN-1K, and they have good transferability. In other words, they have satisfactory generalization performance for a new domain, which is shown below. We utilize the commonly used CIFAR-10 [38], CIFAR-100 [38], Flowers-102 [55], and Stanford Cars [37] for this experiment. See the references and Appendix B.4 for details on the datasets. The results of the proposed model and the results in previous studies of models with comparable capacity are presented in Table 3. In particular, Sequencer2D-L achieves results that are competitive with Cai T-S-36 [73] and Efficient Net-B7 [67]. We experiment for semantic segmentation on ADE20K[89] dataset. See Appendix C.4 for details on the setup. Sequencer outperforms PVT [79] and Pool Former [83] with similar parameters; compared to Pool Former, m Io U is about 6 pts higher. We have investigated a commonly object detection model with Sequencer as the backbone. Its performance is not much different from the case of Res Net [22] backbone. Its improvement is the future work. See Appendix C.5. 4.5 Analysis and visualization In this subsection, we investigate the properties of Sequencer in terms of resolution adaptability and efficiency. Furthermore, effective receptive field (ERF) [51] and visualization of the hidden states provides insight into the question of how Sequencer recognizes images. One of the attractive properties of Sequencer is its flexible adaptability to the resolution, with minimal impact on accuracy even when the resolution of the input image is varied from one-half to twice. In comparison, architectures like MLP-Mixer [70] have a fixed input resolution, and GFNet [61] requires interpolation of weights in the Fourier domain when inputting images with a resolution different from training images. We evaluate the resolution adaptability of models comparatively by inputting different resolution images to each model, without fine-tuning, with pre-trained weights on IN-1K at the resolution of 2242. Figure 3a compares absolute top-1 accuracy on IN-1K, and Figure 3b compares relative one to the input image with the resolution of 2242. By increasing the resolution by 28 for Sequencer2D-S and by 32 for other models, we avoid padding and prevent the effect of padding on accuracy. Compared to Dei T-S [72], GFNet-S [61], Cycle MLP-B2 [7], and Conv Ne Xt-T [49], Sequencer-S s performance is more sustainable. The relative accuracy is consistently better than Conv Ne Xt [49], which is influential in the lower-resolution band, and, at 4482 resolution, 0.6% higher than Cycle MLP [7], which is influential in the double-resolution band. It is noteworthy that Sequencer continues to maintain high accuracy on double resolution. The higher the input resolution, the higher memory-efficiency and throughput of Sequencers when compared to Dei T [72]. Figure 3 shows the efficiency of Sequencer2D-S when compared to Dei T-S and Conv Ne Xt-T [49]. Memory consumption increases rapidly in Dei T-S and Conv Ne Xt-T with increasing input resolution, but more gradual increase in Sequencer2D-S. The result strongly implies that it has more practical potential as the resolution increases than the Vi Ts. At a resolution of 2242, it is behind Dei T in throughput, but it stands ahead of Dei T when images with a resolution of 8962 are input. (a) Absolute top-1 acc. (b) Relative top-1 acc. to 2242 res. (c) GPU peak memory (d) Throughput Figure 3: Top. Resolution adaptability. Every model is trained at 2242 resolution and evaluated at various resolutions with no finetuning. Bottom. Comparisons among Sequencer2D-S, Dei T-S [72], and Conv Ne Xt-T [49] in (c) GPU peak memory and (d) throughput for different input image resolutions. Measured for each increment of 2242 resolution, points not plotted are when GPU memory is exhausted. The measurements are founded on a batch size of 16 and a single V100. Figure 4: Part of states of the last Bi LSTM2D layer in the Sequencer block of stage 1. From top to bottom: outputs of ver-LSTM, hor LSTM, and ch-fusions and original images. In general, CNNs have localized, layer-by-layer expanding receptive fields, and Vi Ts without shifted windows capture global dependencies, working the self-attention mechanism. In contrast, in the case of Sequencer, it is not clear how information is processed in Sequencer block. We calculated ERF [51] for Res Net-50 [22], Dei T-S [72], and Sequencer2D-S as shown in Figure 5. ERFs of Sequencer2D-S form a cruciform shape in all layers. The trend distinguishes it from well-known models such as Dei T-S and Res Net-50. More remarkably, in shallow layers, Sequencer2D-S has a wider ERF than Res Net-50, although not as wide as Dei T. This observation confirms that LSTMs in Sequencer can model long-term dependencies as expected and that Sequencer recognizes sufficiently long vertical or horizontal regions. Thus, it can be argued that Sequencer recognizes an image in a very different way than CNNs or Vi Ts. For more details on ERF and additional visualization, see Appendix D. Moreover we also visualized a hidden state of vertical and horizontal Bi LSTM, and a feature map after channel fusion, and the results are visualized in Figure 4. It demonstrates that our Sequencer has the hidden states interact with each other over the vertical and horizontal directions. The closer tokens are in position, the stronger their interaction tends to be; the farther tokens are in position, the less their interaction tends to be. (a) RN50/first (b) Dei TS/first (c) Seq S/first (d) RN50/last (e) Dei TS/last (f) Seq S/last Figure 5: The visualizations are the ERFs of Sequencer2D-S and comparative models such as Res Net50 and Dei T-S. The left of the slash denotes the model name, and the right of the slash denotes the location of the block of output used to generate the ERFs. The ERFs are rescaled from 0 to 1. The brighter and more influential the region is, the closer to 1, and the darker, the closer to 0. 5 Conclusions We propose a novel and simple architecture that leverages LSTM for computer vision. It is demonstrated that new modeling with LSTM instead of the self-attention layer can achieve competitive performance with current state-of-the-art models. Our experiments show that Sequencer has a good memory-resource/accuracy and parameter/accuracy tradeoffs, comparable to the main existing methods. Despite the impact of recursion on throughput, we have demonstrated benefits over it. We believe that these results raise a number of interesting issues. Improving Sequencer s poor throughput is one example. Moreover, we expect that investigating the internal mechanisms of our model using methods other than ERF will further our understanding of how this architecture works. In addition, it would be important to analyze in more detail the features learned by Sequencer in comparison to other architectures. We hope this will lead to a better understanding of the role of various inductive biases in computer vision. Furthermore, we expect that our results trigger further study beyond the domain or research area. Especially, it would be a very interesting open question to see if such a design works with time-series data in vision such as video or in a multi-modal problem setting combined with another modality such as video with audio. Acknowledgments and Disclosure of Funding Our colleagues at Any Tech Co., Ltd. provided valuable comments on the early versions and encouragement. We thank them for their cooperation. In particular, We thank Atsushi Fukuda for organizing discussion opportunities. We also thank people who support us, belonging to Graduate School of Artificial Intelligence and Science, Rikkyo University. [1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. In Neur IPS, 2016. [2] Lucas Beyer, Olivier J Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and Aäron van den Oord. Are we done with imagenet? ar Xiv preprint ar Xiv:2006.07159, 2020. [3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Neur IPS, volume 33, pages 1877 1901, 2020. [4] Wonmin Byeon, Thomas M Breuel, Federico Raue, and Marcus Liwicki. Scene labeling with lstm recurrent neural networks. In CVPR, pages 3547 3555, 2015. [5] Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification. In ICCV, pages 357 366, 2021. [6] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In ICML, pages 1691 1703. PMLR, 2020. [7] Shoufa Chen, Enze Xie, Chongjian GE, Runjian Chen, Ding Liang, and Ping Luo. Cycle MLP: A MLP-like architecture for dense prediction. In ICLR, 2022. [8] Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103 111, 2014. [9] François Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, pages 1251 1258, 2017. [10] James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19(90):297 301, 1965. [11] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Rand Augment: Practical automated data augmentation with a reduced search space. In CVPRW, pages 702 703, 2020. [12] Terrance De Vries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. ar Xiv preprint ar Xiv:1708.04552, 2017. [13] Xiaohan Ding, Honghao Chen, Xiangyu Zhang, Jungong Han, and Guiguang Ding. Repmlpnet: Hierarchi- cal vision mlp with re-parameterized locality. In CVPR, 2022. [14] Xiaohan Ding, Xiangyu Zhang, Yizhuang Zhou, Jungong Han, Guiguang Ding, and Jian Sun. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In CVPR, 2022. [15] Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, 2022. [16] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. [17] Amir Gholami, Kiseok Kwon, Bichen Wu, Zizheng Tai, Xiangyu Yue, Peter Jin, Sicheng Zhao, and Kurt Keutzer. Squeezenext: Hardware-aware neural network design. In CVPRW, pages 1638 1647, 2018. [18] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015. [19] Alex Graves, Santiago Fernández, and Jürgen Schmidhuber. Multi-dimensional recurrent neural networks. In International conference on artificial neural networks, pages 549 558. Springer, 2007. [20] Alex Graves and Jürgen Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In Neur IPS, volume 21, 2008. [21] Jianyuan Guo, Yehui Tang, Kai Han, Xinghao Chen, Han Wu, Chao Xu, Chang Xu, and Yunhe Wang. Hire-mlp: Vision mlp via hierarchical rearrangement. ar Xiv preprint ar Xiv:2108.13341, 2021. [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770 778, 2016. [23] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, pages 8340 8349, 2021. [24] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, 2018. [25] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). ar Xiv preprint ar Xiv:1606.08415, [26] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR, pages 15262 15271, 2021. [27] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735 1780, [28] Qibin Hou, Zihang Jiang, Li Yuan, Ming-Ming Cheng, Shuicheng Yan, and Jiashi Feng. Vision permutator: A permutable mlp-like architecture for visual recognition. IEEE TPAMI, 2022. [29] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, pages 4700 4708, 2017. [30] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In ECCV, pages 646 661, 2016. [31] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In ICCV, pages 603 612, 2019. [32] Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long short-term memory. In In Proceedings of the IEEE Workshop on Spoken Language Technology, 2015. [33] Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A Smith. Finetuning pretrained transformers into rnns. In EMNLP, pages 10630 10643, 2021. [34] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In ICML, pages 5156 5165. PMLR, 2020. [35] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pages 4171 4186, 2019. [36] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár. Panoptic feature pyramid networks. In CVPR, pages 6399 6408, 2019. [37] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3D object representations for fine-grained categorization. In ICCVW, pages 554 561, 2013. [38] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. [39] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Neur IPS, volume 25, pages 1097 1105, 2012. [40] Hai Lan, Xihao Wang, and Xian Wei. Couplformer: Rethinking vision transformer with coupling attention map. ar Xiv preprint ar Xiv:2112.05425, 2021. [41] Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [42] Dongze Lian, Zehao Yu, Xing Sun, and Shenghua Gao. As-mlp: An axial shifted mlp architecture for vision. In ICLR, 2022. [43] Xiaodan Liang, Xiaohui Shen, Donglai Xiang, Jiashi Feng, Liang Lin, and Shuicheng Yan. Semantic object parsing with local-global long short-term memory. In CVPR, pages 3185 3193, 2016. [44] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, pages 2980 2988, 2017. [45] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740 755, 2014. [46] Hanxiao Liu, Zihang Dai, David So, and Quoc Le. Pay attention to mlps. In Neur IPS, volume 34, 2021. [47] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In CVPR, 2022. [48] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. [49] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In CVPR, 2022. [50] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019. [51] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks. In Neur IPS, volume 29, 2016. [52] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018. [53] Xiaofeng Mao, Gege Qi, Yuefeng Chen, Xiaodan Li, Ranjie Duan, Shaokai Ye, Yuan He, and Hui Xue. Towards robust vision transformer. ar Xiv preprint ar Xiv:2105.07926, 2021. [54] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In Neur IPS Workshop, 2011. [55] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In ICVGIP, pages 722 729, 2008. [56] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Neur IPS, volume 32, 2019. [57] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. Technical report, Open AI, 2018. [58] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019. [59] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Designing network design spaces. In CVPR, pages 10428 10436, 2020. [60] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21:1 67, 2020. [61] Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. Global filter networks for image classification. In Neur IPS, volume 34, 2021. [62] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In ICML, pages 5389 5400. PMLR, 2019. [63] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE TSP, 45(11):2673 2681, 1997. [64] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- tion. In ICLR, 2015. [65] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, pages 1 9, 2015. [66] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, pages 2818 2826, 2016. [67] Mingxing Tan and Quoc Le. Efficient Net: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105 6114, 2019. [68] Chuanxin Tang, Yucheng Zhao, Guangting Wang, Chong Luo, Wenxuan Xie, and Wenjun Zeng. Sparse mlp for image recognition: Is self-attention really necessary? ar Xiv preprint ar Xiv:2109.05422, 2021. [69] Yuki Tatsunami and Masato Taki. Raftmlp: How much can be done without attention and with less spatial locality? ar Xiv preprint ar Xiv:2108.04384, 2021. [70] Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. In Neur IPS, volume 34, 2021. [71] Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, and Hervé Jégou. Resmlp: Feedforward networks for image classification with data-efficient training. ar Xiv preprint ar Xiv:2105.03404, 2021. [72] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In ICML, 2021. [73] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. In ICCV, pages 32 42, 2021. [74] Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International conference on machine learning, pages 1747 1756. PMLR, 2016. [75] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neur IPS, volume 30, 2017. [76] Francesco Visin, Marco Ciccone, Adriana Romero, Kyle Kastner, Kyunghyun Cho, Yoshua Bengio, Matteo Matteucci, and Aaron Courville. Reseg: A recurrent neural network-based model for semantic segmentation. In CVPRW, pages 41 48, 2016. [77] Francesco Visin, Kyle Kastner, Kyunghyun Cho, Matteo Matteucci, Aaron Courville, and Yoshua Ben- gio. Renet: A recurrent neural network based alternative to convolutional networks. ar Xiv preprint ar Xiv:1505.00393, 2015. [78] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. In Neur IPS, volume 32, 2019. [79] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021. [80] Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, [81] Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, and Ping Li. S2-mlpv2: Improved spatial-shift mlp architecture for vision. ar Xiv preprint ar Xiv:2108.01072, 2021. [82] Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, and Ping Li. S2-mlp: Spatial-shift mlp architecture for vision. In WACV, pages 297 306, 2022. [83] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. In CVPR, 2022. [84] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV, pages 558 567, 2021. [85] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, pages 6023 6032, 2019. [86] David Junhao Zhang, Kunchang Li, Yunpeng Chen, Yali Wang, Shashwat Chandra, Yu Qiao, Luoqi Liu, and Mike Zheng Shou. Morphmlp: A self-attention free, mlp-like backbone for image and video. ar Xiv preprint ar Xiv:2111.12527, 2021. [87] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018. [88] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In AAAI, volume 34, pages 13001 13008, 2020. [89] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR, pages 633 641, 2017. [90] Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. Deepvit: Towards deeper vision transformer. ar Xiv preprint ar Xiv:2103.11886, 2021. [91] Daquan Zhou, Yujun Shi, Bingyi Kang, Weihao Yu, Zihang Jiang, Yuan Li, Xiaojie Jin, Qibin Hou, and Jiashi Feng. Refiner: Refining self-attention for vision transformers. ar Xiv preprint ar Xiv:2106.03714, 2021.