# autoscaling_vision_transformers_without_training__ff8e9119.pdf

Published as a conference paper at ICLR 2022

AUTO-SCALING VISION TRANSFORMERS WITHOUT TRAINING

Wuyang Chen1 , Wei Huang2 , Xianzhi Du3, Xiaodan Song3, Zhangyang Wang1, Denny Zhou3

1University of Texas, Austin 2University of Technology Sydney 3Google {wuyang.chen,atlaswang}@utexas.edu weihuang.uts@gmail.com {xianzhi,xiaodansong,dennyzhou}@google.com

This work targets automated designing and scaling of Vision Transformers (Vi Ts). The motivation comes from two pain spots: 1) the lack of efﬁcient and principled methods for designing and scaling Vi Ts; 2) the tremendous computational cost of training Vi T that is much heavier than its convolution counterpart. To tackle these issues, we propose As-Vi T, an auto-scaling framework for Vi Ts without training, which automatically discovers and scales up Vi Ts in an efﬁcient and principled manner. Speciﬁcally, we ﬁrst design a seed Vi T topology by leveraging a trainingfree search process. This extremely fast search is fulﬁlled by a comprehensive study of Vi T s network complexity, yielding a strong Kendall-tau correlation with ground-truth accuracies. Second, starting from the seed topology, we automate the scaling rule for Vi Ts by growing widths/depths to different Vi T layers. This results in a series of architectures with different numbers of parameters in a single run. Finally, based on the observation that Vi Ts can tolerate coarse tokenization in early training stages, we propose a progressive tokenization strategy to train Vi Ts faster and cheaper. As a uniﬁed framework, As-Vi T achieves strong performance on classiﬁcation (83.5% top1 on Image Net-1k) and detection (52.7% m AP on COCO) without any manual crafting nor scaling of Vi T architectures: the end-toend model design and scaling process costs only 12 hours on one V100 GPU. Our code is available at https://github.com/VITA-Group/As Vi T.

1 INTRODUCTION

Transformer (Vaswani et al., 2017), a family of architectures based on the self-attention mechanism, is notable for modeling long-range dependencies in the data. The success of transformers has evolved from natural language processing to computer vision. Recently, Vision Transformer (Vi T) (Dosovitskiy et al., 2020), a transformer architecture consisting of self-attention encoder blocks, has been proposed to achieve competitive performance to convolution neural networks (CNNs) (Simonyan & Zisserman, 2014; He et al., 2016) on Image Net (Deng et al., 2009).

However, it remains elusive on how to effectively design, scale-up, and train Vi Ts, with three important gaps awaiting. First, Dosovitskiy et al. (2020) directly hard-split the 2D image into a series of local patches, and learn the representation with a pre-deﬁned number of attention heads and channel expansion ratios. These ad-hoc tokenization and embedding mainly inherit from language tasks (Vaswani et al., 2017) but are not customized for vision, which calls for more ﬂexible and principled designs. Second, the learning behaviors of Vi T, including (loss of) feature diversity (Zhou et al., 2021), receptive ﬁelds (Raghu et al., 2021) and augmentations (Touvron et al., 2020; Jiang et al., 2021), differ vastly from CNNs. Beneﬁting from self-attention, Vi T can capture global information even with shallow layers, yet its performance is quickly plateaued as going deeper. Strong augmentations are also vital to avoid Vi Ts from overﬁtting. These observations indicate that Vi T architectures may require uniquely customized scaling-up laws to learn a more meaningful representation hierarchy. Third, training Vi Ts is both data and computation-consuming. To achieve state-of-the-art performance, Vi T requires up to 300 million images and thousands of TPU-days. Although recent works attempt to enhance Vi T s data and resource efﬁciency (Touvron et al., 2020; Hassani et al., 2021; Pan et al., 2021; Chen et al., 2021d), the heavy computation cost (e.g., quadratic with respect to the number of tokens) is still overwhelming, compared with training CNNs.

Published as a conference paper at ICLR 2022

We point out that the above gaps are inherently connected by the core architecture problem: how to design and scale-up Vi Ts? Different from the convolutional layer that directly digests raw pixels, Vi Ts embed coarse-level local patches as input tokens. Shall we divide an image into non-overlapping tokens of smaller size, or larger but overlapped tokens? The former could embed more visual details in each token but ignores spatial coherency, while the latter sacriﬁces the local details but may beneﬁt more spatial correlations among tokens. A further question is on Vi T s depth/width trade-off: shall we prefer a wider and shallower Vi T, or a narrower but deeper one? A similar dilemma also persists for Vi T training: reducing the number of tokens would effectively speed up the Vi T training, but meanwhile might sacriﬁce the training performance if sticking to coarse tokens from end to end.

In this work, we aim to reform the discovery of novel Vi T architectures. Our framework, called As-Vi T (Auto-scaling Vi T), allows for extremely fast, efﬁcient, and principled Vi T design and scaling. In short, As-Vi T ﬁrst ﬁnds a promising seed topology for Vi T of small depths and widths, then progressively grow it into different sizes (number of parameters) to meet different needs. Speciﬁcally, our seed Vi T topology is discovered from a search space relaxed from recent manual Vi T designs. To compare different topologies, we automate this process by a training-free architecture search approach and the measurement of Vi T s complexity, which are extremely fast and efﬁcient. This training-free search is supported by our comprehensive study of various network complexity metrics, where we ﬁnd the expected length distortion has the best trade-off between time costs and Kendall-tau correlations. Our seed Vi T topology is then progressively scaled up from a small network to a large one, generating a series of Vi T variants in a single run. Each step, the increases of depth and width are automatically and efﬁciently balanced by comparing network complexities. Furthermore, to address the data-hungry and heavy computation costs of Vi Ts, we make our Vi T tokens elastic, and propose a progressive re-tokenization method for efﬁcient Vi T training. We summarize our contributions as below:

1. We for the ﬁrst time automate both the backbone design and scaling of Vi Ts. A seed Vi T topology is ﬁrst discovered (in only seven V100 GPU-hours), and then its depths and widths are grown with a principled scaling rule in a single run (ﬁve more V100 GPU-hours).

2. To estimate Vi T s performance at initialization without any training, we conduct the ﬁrst comprehensive study of Vi T s network complexity measurements. We empirically ﬁnd the expected length distortion has the best trade-off between the computation costs and its Kendall-tau correlations with Vi T s ground-truth accuracy.

3. During training, we propose a progressive re-tokenization scheme via the change of dilation and stride, which demonstrates to be a highly efﬁcient Vi T training strategy that saves up to 56.2% training FLOPs and 41.1% training time, while preserving a competitive accuracy.

4. Our As-Vi T achieves strong performance on classiﬁcation (83.5% top-1 on Image Net-1k) and detection (52.7% m AP on COCO).

2 WHY WE NEED AUTOMATED DESIGN AND SCALING PRINCIPLE FOR VIT?

Background and recent development of Vi T1 To transform a 2D image into a sequence, Vi T (Dosovitskiy et al., 2020) splits each image into 14 14 or 16 16 patches and embeds them into a ﬁxed number of tokens; then following practice of the transformer for language modeling, Vi T applies self-attention to learn reweighting masks as relationship modeling for tokens, and leverages FFN (Feed-Forward Network) layers to learn feature embeddings. To better facilitate the visual representation learning, recently works try to train deeper Vi Ts (Touvron et al., 2021; Zhou et al., 2021), incorporate convolutions (Wu et al., 2021; d Ascoli et al., 2021; Yuan et al., 2021a), and design multi-scale feature extractions (Chen et al., 2021b; Zhang et al., 2021; Wang et al., 2021).

Why manual design and scaling may be suboptimal? As the Vi T architecture is still in its infant stage, there is no principle in its design and scaling. Early designs incorporate large token sizes, constant sequence length, and hidden size (Dosovitskiy et al., 2020; Touvron et al., 2020), and recent trends include small patches, spatial reduction, and channel doubling (Zhou et al., 2021; Liu et al., 2021). They all achieve comparably good performance, leaving the optimal choices unclear. Moreover, different learning behaviors of transformers from CNNs make the scaling law of Vi Ts

1We generally use the term Vi T to indicate deep networks of self-attention blocks for vision problems. We always include a clear citation when we speciﬁcally discuss the Vi Ts proposed by Dosovitskiy et al. (2020).

Published as a conference paper at ICLR 2022

highly unclear. Recent works (Zhou et al., 2021) demonstrated that attention maps of Vi Ts gradually become similar in deeper layers, leading to identical feature maps and saturated performance. Vi T also generates more uniform representations across layers, enabling early aggregation of global context (Raghu et al., 2021). This is contradictory to CNNs as deeper layers help the learning of visual global information (Chen et al., 2018). These observations all indicate that previously studied scaling laws (depth/width allocations) for CNNs (Tan & Le, 2019) may not be appropriate to Vi Ts.

What principle do we want? We aim to automatically design and scale-up Vi Ts, being principled and avoiding manual efforts and potential biases. We also want to answer two questions: 1) Does Vi T have any preference in its topology (patch sizes, expansion ratios, number of attention heads, etc.)? 2) Does Vi T necessarily follow the same scaling rule of CNNs?

3 AUTO-DESIGN & SCALING OF VITS WITH NETWORK COMPLEXITY

Image: 3 𝐻 𝑊

Patch Embedding: Kernel = 𝐾&, Stride = 4

Attention: #splits = 𝑆& FFN: expansion ratio = 𝐸&

4 L& layers

Patch Re-embedding: Kernel = 𝐾,, Stride = 2

Attention: #splits = 𝑆, FFN: expansion ratio = 𝐸,

8 L, layers

Patch Re-embedding: Kernel = 𝐾/, Stride = 2

Attention: #splits = 𝑆/ FFN: expansion ratio = 𝐸/

16 L/ layers

Patch Re-embedding: Kernel = 𝐾2, Stride = 2

Attention: #splits = 1 FFN: expansion ratio = 𝐸2

32 L2 layers

topology search

output Figure 1: Overall architecture of our As-Vi T. Blue italics indicates topology conﬁgurations to be searched (Table 1). Red indicates depth/width to be scaled-up.

To accelerate in Vi T designing and avoid tedious manual efforts, we target efﬁcient, automated, and principled search and scaling of Vi Ts. Speciﬁcally, we have two problems to solve: 1) with zero training cost (Section 3.2), how to efﬁciently ﬁnd the optimal Vi T architecture topology (Section 3.3)? 2) how to scale-up depths and widths of the Vi T topology to meet different needs of model sizes (Section 3.4)?

3.1 EXPANDED TOPOLOGY SPACE FOR VITS

Before designing and scaling, we ﬁrst brieﬂy introduce our expanded topology search space for our As-Vi T (blue italics in Figure 1). We ﬁrst embed the input image into patches of a 1

4-scale resolution, and adopt a stage-wise spatial reduction and channel doubling strategy. This is for the convenience of dense prediction tasks like detection that require multi-scale features. Table 1 summarizes details of our topology space, and will be explained below.

Elastic kernels. Instead of generating nonoverlapped image patches, we propose to search for the kernel size. This will enable patches to be overlapped with their neighbors, introducing more spatial correlations among tokens. Moreover, each time we downsample the spatial resolution, we also introduce overlaps when re-embedding local tokens (implemented by either a linear or a convolutional layer).

Table 1: Topology Search Space for our As-Vi T.

Stage Sub-space Choices

#1 Kernel K1 4, 5, 6, 7, 8 Attention Splits S1 2, 4, 8 FFN Expansion E1 2, 3, 4, 5, 6

#2 Kernel K2 2, 3, 4 Attention Splits S2 1, 2, 4 FFN Expansion E2 2, 3, 4, 5, 6

#3 Kernel K3 2, 3, 4 Attention Splits S3 1, 2 FFN Expansion E3 2, 3, 4, 5, 6

#4 Kernel K4 2, 3, 4 FFN Expansion E4 2, 3, 4, 5, 6

- Num. Heads 16, 32, 64

Elastic attention splits. Splitting the attention into local windows is an important design to reduce the computation cost of self-attention without sacriﬁcing much performance (Zaheer et al., 2020; Liu et al., 2021). Instead of using a ﬁxed number of splits, we propose to search for elastic attention splits for each stage2. Note that we try to make our design general and do not use shifted windows (Liu et al., 2021).

More search dimensions. Vi T (Dosovitskiy et al., 2020) by default leveraged an FFN layer with 4 expanded hidden dimension for each attention block. To enable a more ﬂexible design of Vi T architectures, for each stage we further search over the FFN expansion ratio. We also search for the ﬁnal number of heads for the self-attention module.

2Due to spatial reduction, the 4th stage may already reach a resolution at 7 7 on Image Net, and we set its splitting as 1.

Published as a conference paper at ICLR 2022

3.2 ASSESSING VIT COMPLEXITY AT INITIALIZATION VIA MANIFOLD PROPAGATION

Training Vi Ts is slow: hence an architecture search guided by evaluating trained models accuracies will be dauntingly expensive. We note a recent surge of training-free neural architecture search methods for Re LU-based CNNs, leveraging local linear maps (Mellor et al., 2020), gradient sensitivity (Abdelfattah et al., 2021), number of linear regions (Chen et al., 2021e;f), or network topology (Bhardwaj et al., 2021). However, Vi Ts are equipped with more complex non-linear functions: self-attention, softmax, and Ge LU. Therefore, we need to measure their learning capacity in a more general way. In our work, we consider measuring the complexity of manifold propagation through Vi T, to estimate how complex functions can be approximated by Vi Ts.

Intuitively, a complex network can propagate a simple input into a complex manifold at its output layer, thus likely to possess a strong learning capacity. In our work, we study the manifold complexity of mapping a simple circle input through the Vi T: h(θ) =

N u0 cos(θ) + u1 sin(θ) . Here, N is the dimension of Vi T s input (e.g. N = 3 224 224 for Image Net images), u0 and u1 form an orthonormal basis for a 2-dimensional subspace of RN in which the circle lives. We further deﬁne the Vi T network as N, its input-output Jacobian v(θ) = θN(h(θ)) at the input θ, and a(θ) = θv(θ). We will calculate expected complexities over a certain number of θs uniformly sampled from [0, 2π). In our work, we study three different types of manifold complexities:

1. Curvature can be deﬁned as the reciprocal of the radius of the osculating circle on the Vi T s output manifold. Intuitively, a larger curvature indicates that N(θ) changes fast at a certain θ. According to Riemannian geometry (Lee, 2006; Poole et al., 2016), the curvature can be explicitly calculated as κ = R (v(θ) v(θ)) 3/2p

(v(θ) v(θ))(a(θ) a(θ)) (v(θ) a(θ))2dθ.

2. Length Distortion in Euclidean space is deﬁned as LE = length(N(θ))

length(θ) = R p

v(θ) 2dθ. It measures when the network takes a unit-length curve as input, what is the length of the output curve. Since the ground-truth function we want to estimate (using N) is usually very complex, one may also expect that networks with better performance should also generate longer outputs.

3. The problem of LE is that, stretched outputs not necessarily translate to complex outputs. A simple example: even an appropriately initialized linear network could grow a straight line into a long output (i.e. a large norm of input-output Jacobian). Therefore, one could instead use Length Distortion taking curvature into consideration to measure how quickly the normalized Jacobian ˆv(θ) = v(θ)/ p

v(θ) v(θ) changes with respect to θ, deﬁned as LE κ = R p

θˆv(θ) 2dθ.

Figure 2: Correlations between κ, LE, LE κ and trained accuracies of Vi T topologies from our search space.

Table 2: Complexity Study. τ: Kendall-tau correlation. Time: per Vi T topology on average on 1 V100 GPU.

Complexity τ Time

κ -0.49 38.3s LE 0.49 12.8s LE κ -0.01 48.2s

In our study, we aim to compare the potential of using these three complexity metrics to guide the Vi T architecture selection. As the core of neural architecture search is to rank the performance of different architectures, we measure the Kendall-tau correlations (τ) between these metrics and models ground-truth accuracies. We randomly sampled 87 Vi T topologies from Table 1 (with L1 = L2 = L3 = L4 = 1, C = 32), fully train them on Image Net-1k for 300 epochs (following the same training recipe of Dei T (Touvron et al., 2020)), and also measure their κ, LE, LE κ at initialization. As shown in Figure 2, we can clearly see that both κ and LE exhibit high Kendall-tau correlations. κ has a negative correlation, which may indicate that changes of output manifold on the tangent direction are more important to Vi T training, instead of on the perpendicular direction. Meanwhile, κ costs too much computation time due to second derivatives. We decide to choose LE as our complexity measure for highly fast Vi T topology search and scaling.

Published as a conference paper at ICLR 2022

3.3 LE AS REWARD FOR SEARCHING VIT TOPOLOGIES

We now propose our training-free search based on LE (Algorithm 1). Most NAS (neural architecture search) methods evaluate the accuracies or loss values of single-path or super networks as proxy inference. This training-based search will suffer from more computation costs when applied to Vi Ts. Instead of training Vi Ts, for each architecture we sample, we calculate LE and treat it as the reward to guide the search process. In addition to LE, we also include the NTK condition number κΘ = λmax

λmin to indicate the trainability of Vi Ts (Chen et al., 2021e; Xiao et al., 2019; Yang, 2020; Hron et al., 2020). λmax and λmin are the largest and smallest eigenvalue of NTK matrix Θ.

Algorithm 1: Training-free Vi T Topology Search.

1 Input: RL policy π, step t = 0, total steps T.

2 while t < T do

3 Sample topology at from π.

4 Calculate LE t and κΘ,t for at.

5 Normalization: L

VE t = LE t LE t 1 maxt LE t mint LE t , κ

Θ,t = κΘ,t κΘ,t 1 maxt κΘ,t mint κΘ,t , t = 1, , t.

6 Update policy π using reward rt = L

Θ,t by policy gradient (Williams, 1992).

7 t = t + 1.

8 return Topology a of highest probability from π.

Table 3: Statistics of topology search. *Standard deviation is normalized by mean due to different value ranges.

Search Space Mean Std*

K1 7.3 0.1 K2 4 0 K3 4 0 K4 4 0 E1 3.3 0.4 E2 3.9 0.4 E3 4.2 0.3 E4 5.2 0.2 S1 4 0.6 S2 2.7 0.5 S3 1.5 0.3 Head 42.7 0.5

We use reinforcement learning (RL) for search. The RL policy is formulated as a joint categorical distribution over the choices in Table 1, and is updated by policy gradient (Williams, 1992). We update our policy for 500 steps, which is observed enough for the policy to converge (entropy drops from 15.3 to 5.7). The search process is extremely fast: only seven GPU-hours (V100) on Image Net-1k, thanks to the fast calculation of LE that bypasses the Vi T training. To address the different magnitude of LE and κΘ, we normalize them by their relative value ranges (line 5 in Algorithm 1). We summarize the Vi T topology statistics from our search in Table 3. We can see that LE and κΘ highly prefer: (1) tokens with overlaps (K1 K4 are all larger than strides), and (2) larger FFN expansion ratios in deeper layers (E1 < E2 < E3 < E4). No clear preference of LE and κΘ are found on attention splits and number of heads.

3.4 AUTOMATIC AND PRINCIPLED SCALING OF VITS

After obtaining an optimal topology, another question is: how to balance the network depth and width? Currently, there is no such rule of thumb for Vi T scaling. Recent works try to scale-up or grow convolutional networks of different sizes to meet various resource constraints (Liu et al., 2019a; Tan & Le, 2019). However, to automatically ﬁnd a principled scaling rule, training Vi Ts will cost enormous computation costs. It is also possible to search different Vi T variants (as in Section 3.3), but that requires multiple runs. Instead, scaling-up is a more natural way to generate multiple model variants in one experiment. We are therefore motivated to scale-up our searched basic seed Vi T to a larger model in an efﬁcient training-free and principled manner.

We depict our auto-scaling method in Algorithm 2. The starting-point architecture has one attention block for each stage, and an initial hidden dimension C = 32. In each iteration, we greedily ﬁnd the optimal depth and width to scale-up next. For depth, we try to ﬁnd out which stage to deepen (i.e., add one attention block to which stage); for width, we try to discover the best expansion ratio (i.e., widen the channel number to what extent). The rule to choose how to scale-up is by comparing the propagation complexity among a set of scaling choices. For example, in the case of four backbone stages (Table 1) and four expansion ratio choices ([0.05 , 0.1 , 0.15 , 0.2 ]), we have 4 4 = 16 scaling choices in total for each step. We calculate LE and κΘ after applying each choice, and the one with the best LE / κΘ trade-off (minimal sum of rankings by LE and κΘ) will be selected to scale-up with. The scaling stops when a certain limit of parameter number is reached. In our work, we stop the scaling process once the number of parameters reaches 100 million, and the scaling only takes ﬁve GPU hours (V100) on Image Net-1k.

Published as a conference paper at ICLR 2022

4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34

Total Depths

Total Widths

Res Net-101

Auto-scaling (ours)

Random scaling Swin (Liu et al. 2021)

Vi T (Zhai et al. 2021)

Res Net (He et al. 2016)

1200 1250 1300 1350 1400 1450 1500 1550 1600 : NTK condition number (#)

LE: Expected Length Distortion (")

1 2 34 5 6 7

Figure 3: Left: Comparing scaling rules from As-Vi T, random scaling, Swin (Liu et al., 2021), Vi T (Zhai et al., 2021), and Res Net (He et al., 2016). Total Depths : number of blocks ( bottleneck of Res Net, attention-block of Vi Ts). Total Widths : sum of output channel numbers from all blocks. Grey areas indicate standard deviations from 10 runs with different random seeds. Right: During the auto-scaling, both the network s complexity and trainability improve (numbers indicate scaling-up steps, LE higher the better, κΘ lower the better).

The scaling trajectory is visualized in Figure 3. By comparing our automated scaling against random scaling, we ﬁnd our scaling principle prefers to sacriﬁce the depths to win more widths, keeping a shallower but wider network. Our scaling is more similar to the rule developed by Zhai et al. (2021). In contrast, Res Net and Swin Transformer (Liu et al., 2021) choose to be narrower and deeper.

Algorithm 2: Training-free Auto-Scaling Vi Ts.

1 Input: seed As-Vi T topology a0, stop criterion (#parameters) P, t = 0, channel expansion ratio choices C = {1.05 , 1.1 , 1.15 , 1.2 } (to increase the width by 5%, 10%, 15%, or 20%), depth choices D = {(+1, 0, 0, 0), (0, +1, 0, 0), (0, 0, +1, 0), (0, 0, 0, +1)} (to add one more layer to one of the four stages in Table 1).

2 while P > number of parameters of at do

3 for each scaling choice gi C D do

4 Scale-up: at,i = at gi. Grow both the channel width and depth.

5 Calculate LE i and κΘ,i for at,i.

6 Get ranking of each scaling choice r L,i by descendingly sort LE i , i = 1, , |C D|.

7 Get ranking of each scaling choice rκΘ,i by ascendingly sort κΘ,i, i = 1, , |C D|.

8 Ascendingly sort each scaling choice gi by r LE,i + rκΘ,i.

9 Select the scaling choice g i with the top (smallest) ranking.

10 at+1 = at g i .

11 t = t + 1.

12 return Growed Vi T architectures a1, a2, , at.

4 EFFICIENT VIT TRAINING VIA PROGRESSIVE ELASTIC RE-TOKENIZATION

𝐾' = 4 stride' = 16 dilation' = 5

FLOPs: 13.2%

𝐾' = 4 stride' = 8 dilation' = 2

FLOPs: 28.5%

𝐾' = 4 stride' = 4 dilation' = 1

FLOPs: 100%

Coarse Sampling Fine-grained Sampling

Small #Tokens Large #Tokens

Figure 4: By progressively changing the sampling granularity (stride and dilation) of the ﬁrst linear project layer, we can reduce the spatial resolutions of tokens and save training FLOPs (37.4% here), while still maintain a competitive ﬁnal performance (Image Net-1k 224 224). See Table 6 for more studies.

Recent works (Jia et al., 2018; Zhou et al., 2019; Fu et al., 2020) show that one can use mixed or progressive precision to achieve an efﬁcient training purpose. The rationale behind this strategy is that, there exist some short-cuts on the network s loss landscape that can be manually created to bypass perhaps less important gradient descent steps, especially during early training phases. As in Vi T, both self-attention and FFN have quadratic computation costs to the number of tokens. It is therefore natural to ask: do we need full-resolution tokens during the whole training process?

We provide an afﬁrming answer by proposing a progressive elastic re-tokenization training strategy. To update the number of tokens during training without affecting the shape of weights in linear projections, we adopt different sampling granularities in the ﬁrst linear projection layer. Taking the

Published as a conference paper at ICLR 2022

ﬁrst projection kernel K1 = 4 with stride = 4 as an example: during training we gradually change the (stride, dilation) pair 3 of the ﬁrst projection kernel to (16, 5), (8, 2), and (4, 1), keeping the shape of weights and the architecture unchanged.

Table 4: As-Vi T topology and scaling rule.

Design Stage K S E Head

Seed Topology (Blue italics in Fig. 1)

#1 8 2 3 4 #2 4 1 2 8 #3 4 1 4 16 #4 4 1 6 32

Scaling (Red in Fig. 1)

Stage-wise Depth Width (C) L1 L2 L3 L4

As-Vi T-Small 3 1 4 2 88 As-Vi T-Base 3 1 5 2 116 As-Vi T-Large 5 2 5 2 180

This re-tokenization strategy emulates curriculum learning for Vi Ts: when the training begins, we introduce coarse sampling to significantly reduce the number of tokens. In other words, our As-Vi T quickly learns coarse information from images in early training stages at extremely low computation cost (only 13.2% FLOPs of full-resolution training). Towards the late phase of training, we progressively switch to ﬁne-grained sampling, restore the full token resolution, and maintain the competitive accuracy. As shown in Figure 4, when the Vi T is trained with coarse sampling in early training phases, it can still obtain high accuracy while requiring extremely low computation cost. The transition between different sampling granularity introduces a jump in performance, and eventually the network restores its competitive ﬁnal performance.

5 EXPERIMENTS

5.1 AS-VIT: AUTO-SCALING VIT

Table 5: Image Classiﬁcation on Image Net-1k (224 224).

Method Params. FLOPs Top-1

Reg Net Y-4GF (Radosavovic et al., 2020) 21.0 M 4.0 B 80.0% Vi T-S (Dosovitskiy et al., 2020) 22.1 M 9.2 B 81.2% Dei T-S (Touvron et al., 2020) 22.0 M 4.6 B 79.8% T2T-Vi T-14 (Yuan et al., 2021b) 21.5 M 6.1 B 81.7% TNT-S (Han et al., 2021) 23.8 M 5.2 B 81.5% PVT-Small (Wang et al., 2021) 24.5 M 3.8 B 79.8% Cai T XS-24 (Touvron et al., 2021) 26.6 M 5.4 B 81.8% Deep Vit-S (Zhou et al., 2021) 27 M 6.2 B 82.3% Con Vi T-S (d Ascoli et al., 2021) 27 M 5.4 B 81.3% Cv T-13 (Wu et al., 2021) 20 M 4.5 B 81.6% Cv T-21 (Wu et al., 2021) 32 M 7.1 B 82.5% Swin-T (Liu et al., 2021) 29.0 M 4.5 B 81.3% Boss Net-T0 (Li et al., 2021) - 3.4 B 80.8% Auto Former-s (Chen et al., 2021c) 22.9 M 5.1 B 81.7% GLi T-Small (Chen et al., 2021a) 24.6 M 4.4 B 80.5% As-Vi T Small (ours) 29.0 M 5.3 B 81.2%

Reg Net Y-8GF (Radosavovic et al., 2020) 39.0 M 8.0 B 81.7% T2T-Vi T-19 (Yuan et al., 2021b) 39.2 M 9.8 B 82.2% Cai T S-24 (Touvron et al., 2021) 46.9 M 9.4 B 82.7% Con Vi T-S+ (d Ascoli et al., 2021) 48 M 10 B 82.2% Vi T-S/16 (Dosovitskiy et al., 2020) 48.6 M 20.2 B 78.1% Swin-S (Liu et al., 2021) 50.0 M 8.7 B 83.0% Deep Vi T-L (Zhou et al., 2021) 55 M 12.5 B 82.2% PVT-Medium (Wang et al., 2021) 44.2 M 6.7 B 81.2% PVT-Large (Wang et al., 2021) 61.4 M 9.8 B 81.7% T2T-Vi T-24 (Yuan et al., 2021b) 64.1 M 15.0 B 82.6% TNT-B (Han et al., 2021) 65.6 M 14.1 B 82.8% Boss Net-T1 (Li et al., 2021) - 7.9 B 82.2% Auto Former-b (Chen et al., 2021c) 54 M 11 B 82.4% Vi T-Res NAS-t (Liao et al., 2021) 41 M 1.8 B 80.8% Vi T-Res NAS-s (Liao et al., 2021) 65 M 2.8 B 81.4% As-Vi T Base (ours) 52.6 M 8.9 B 82.5%

Reg Net Y-16GF (Radosavovic et al., 2020) 84.0 M 16.0 B 82.9% Vi T-B/16 (Dosovitskiy et al., 2020) 86.0 M 55.4 B 77.9% Dei T-B (Touvron et al., 2020) 86.0 M 17.5 B 81.8% Con Vi T-B (d Ascoli et al., 2021) 86 M 17 B 82.4% Swin-B (Liu et al., 2021) 88.0 M 15.4 B 83.3% GLi T-Base (Chen et al., 2021a) 96.1 M 17.0 B 82.3% Vi T-Res NAS-m (Liao et al., 2021) 97 M 4.5 B 82.4% Cai T S-48 (Touvron et al., 2021) 89.5 M 18.6 B 83.5% As-Vi T Large (ours) 88.1 M 22.6 B 83.5%

Under 384 384 resolution.

We show our searched As-Vi T topology in Table 4. This architecture facilitates strong overlaps among tokens during both the ﬁrst projection ( tokenization ) step and three re-embedding steps. FFN expansion ratios are ﬁrst narrow then become wider in deeper layers. A small number of attention splits are leveraged for better aggregation of global information.

The seed topology is automatically scaledup, and three As-Vi T variants of comparable sizes with previous works will be benchmarked. Our scaling rule prefers shallower and wider networks, and layers are more balanced among different resolution stages.

5.2 IMAGE CLASSIFICATION

Settings. We benchmark our As-Vi T on Image Net-1k (Deng et al., 2009). We use Tensorﬂow and Keras for training implementations and conduct all training on TPUs. We set the default image size as 224 224, and use Adam W (Loshchilov & Hutter, 2017) as the optimizer with cosine learning rate decay (Loshchilov & Hutter, 2016). A batch size of 1024, an initial learning rate of 0.001, and a weight decay of 0.05 are adopted.

Table 5 demonstrates comparisons of our As-Vi T to other models. Compared to the previous both Transformer-based and CNNbased architectures, As-Vi T achieves stateof-the-art performance with a comparable number of parameters and FLOPs.

3dilation = round((stride/S1 1) K1/(K1 1)) + 1, S1 = 4 is the stride at the full token resolution.

Published as a conference paper at ICLR 2022

More importantly, our As-Vi T framework achieves competitive or stronger performance than concurrent NAS works for Vi Ts with much more search efﬁciency. As-Vi Ts are designed with highly reduced human or NAS efforts. All our three As-Vi T variants are generated in only 12 GPU hours (on a single V100 GPU). In contrast, Bone NAS (Li et al., 2021) requires 10 GPU days to search a single architecture. For each variant of Vi T-Res NAS (Liao et al., 2021), the super-network training takes 16.7 21 hours, followed by another 5.5 6 hours of evolutionary search.

Table 6: Efﬁcient training on Image Net-1k (224 224) via progressive elastic re-tokenization strategy (Section 4). 4 (resp. 2 ) indicates we reduce the number of tokens by 4 (resp. 2) times, and "N/A" indicates no token reduction.

Token Reduction (Epochs) FLOPs Saving Training Time (TPU days) Top1 Acc. 4 2 N/A 1 40 41 70 71 300 18.7% 36.9 83.1%

1 80 81 140 141 300 37.4% 31.0 82.9%

1 120 121 210 211 300 56.2% 25.2 82.5%

Baseline 100% 42.8 83.5

Efﬁcient Training. We leverage the progressive elastic retokenization strategy proposed in Section 4 to reduce both FLOPs and training time for large Vi T models. As illustrated in Figure 4, we progressively apply 4 and 2 reductions on the number of tokens during training by changing both the dilation and the stride of the ﬁrst linear projection layer. We tune the epochs allocated to each token reduction stage and show the results in Table 6. Standard training takes 42.8 TPU days, whereas our efﬁcient training could save up to 56.2% training FLOPs and 41.1% training TPU days, still achieving a strong accuracy.

Table 7: Decoupling the contributions from the seed topology and the scaling, on Image Net-1K.

Model Params. FLOPs Top-1

As-Vi T Topology 2.4 M 0.5 B 61.7% Random Topology 2.2 M 0.4 B 61.4%

As-Vi T Small 29.0 M 5.3 B 81.2% Random Scaling 24.2 M 8.7 B 80.5%

As-Vi T Base 52.6 M 8.9 B 82.5% Random Scaling 42.4 M 15.5 B 82.2%

As-Vi T Large 88.1 M 22.6 B 83.5% Random Scaling 81.1 M 28.7 B 83.2%

Disentangled Contributions from Topology and Scaling. To better verify the contribution from our searched topology and scaling rule, we conduct more ablation studies (Table 7). First, we directly train the searched topology before scaling. Our searched seed topology is better than the best from 87 random topologies in Figure 2. Second, we compare our complexity-based scaling rule with random scaling + As-Vi T topology . At different scales, our automated scaling is also better than random scaling.

5.3 OBJECT DETECTION ON COCO

Settings Beyond image classiﬁcation, we further evaluate our designed As-Vi T on the detection task. Object detection is conducted on COCO 2017 that contains 118,000 training and 5000 validation images. We adopt the popular Cascade Mask R-CNN as the object detection framework for our As-Vi T. We use an input size of 1024 1024, Adam W optimizer (initial learning rate of 0.001), weight decay of 0.0001, and a batch size of 256. Efﬁciently pretrained Image Net-1K checkpoint (82.9% in Table 6) is leveraged as the initialization.

We compare our As-Vi T to standard CNN (Res Net) and previous Transformer network (Swin (Liu et al., 2021)). The comparisons are conducted by changing only the backbones with other settings unchanged. In Table 8 we can see that our As-Vi T can also capture multi-scale features and achieve state-of-the-art detection performance, although being designed on Image Net and its complexity is measured for classiﬁcation.

Table 8: Two-stage object detection and instance segmentation results. We compare employing different backbones with Cascade Mask R-CNN on single model without test-time augmentation.

Backbone Resolution FLOPs Params. APval APmask val Res Net-152 480 800 1333 527.7 B 96.7 M 49.1 42.1 Swin-B (Liu et al., 2021) 480 800 1333 982 B 145 M 51.9 45 As-Vi T Large (ours) 1024 1024 1094.2 B 138.8 M 52.7 45.2

Published as a conference paper at ICLR 2022

6 RELATED WORKS

6.1 VISION TRANSFORMER

Transformers (Vaswani et al., 2017) leverage the self-attention to extract global correlation, and become the dominant models for natural language processing (NLP) (Devlin et al., 2018; Radford et al., 2018; Brown et al., 2020; Liu et al., 2019b). Recent works explored transformers to vision problems: image classiﬁcation (Dosovitskiy et al., 2020), object detection (Carion et al., 2020; Zhu et al., 2020; Zheng et al., 2020; Dai et al., 2020; Sun et al., 2020), segmentation (Chen et al., 2020; Wang et al., 2020), etc. The Vision Transformer (Vi T) (Dosovitskiy et al., 2020) designed a pure transformer architecture and achieved SOTA performance on image classiﬁcation. However, Vi T heavily relies on large-scale datasets (Image Net-21k (Deng et al., 2009), JFT-300M (Sun et al., 2017)) for pretraining, requiring huge computation resources. Dei T (Touvron et al., 2020) proposed Knowledge Distillation (KD) (Hinton et al., 2015; Yuan et al., 2020) via a special KD token to improve both performance and training efﬁciency. In contrast, our proposed As-Vi T introduces more ﬂexible tokenization, attention splitting, and FFN expansion strategies, with automated discovery.

6.2 NEURAL ARCHITECTURE DESIGN AND SCALE

Manual design of network architectures heavily relies on human prior, which is difﬁcult to scale-up. Recent works leverage Auto ML to ﬁnd optimal combinations of operators/topology in a given search space (Zoph & Le, 2016; Real et al., 2019; Liu et al., 2018; Dong & Yang, 2019). However, the searched model are small due to the ﬁxed and hand-crafted search space, far from being scaled-up to modern networks. For example, models from NASNet space (Zoph et al., 2018) only have 5M parameters, much smaller than real-world ones (20 to over 100M). One main reason for not being scalable is because NAS is a computation-consuming task, typically costing 1 2 GPU days to search even small architectures. Meanwhile, many works try to grow a seed architecture to different variants. Efﬁcient Net (Tan & Le, 2019) manually designed a scaling rule for width and depth. Give a template backbone with ﬁxed depth, Liu et al. (2019a) grow the width by gradient descent. For Vi T, we for the ﬁrst time bring both architecture design and scaling together in one framework. To overcome the computation-consuming problem in the training of transformers, we directly use the complexity of manifold propagation as a surrogate measure towards a training-free search and scale.

6.3 EFFICIENT TRAINING

A number of methods have been developed to accelerate the training of deep neural networks, including mixed precision (Jia et al., 2018), distributed optimization (Cho et al., 2017), large-batch training (Goyal et al., 2017; Akiba et al., 2017; You et al., 2018), etc. Jia et al. (2018) combined distributed training with a mixed-precision framework. Wang et al. (2019) proposed to save deep CNN training energy cost via stochastic mini-batch dropping and selective layer update. In our work, customized progressive tokenization via the changing of stride/dilation can effectively reduce the number of tokens during Vi T training, thus largely saving the training cost.

7 CONCLUSIONS

To automate the principled design of vision transformers without tedious human efforts, we propose As-Vi T, a uniﬁed framework that searches and scales Vi Ts without any training. Compared with hand-crafted Vi T architecture, our As-Vi T leverages more token overlaps, increased FFN expansion ratios, and is wider and shallower. Our As-Vi T achieves state-of-the-art accuracies on both Image Net1K classiﬁcation and COCO detection, which veriﬁes the strong performance of our framework. Moreover, with progressive tokenization, we can train heavy Vi T models with largely reduced training FLOPs and time. We hope our methodology could encourage the efﬁcient design and training of Vi Ts for both the transformer and the NAS communities.

ACKNOWLEDGEMENT

Z.W. is in part supported by the NSF AI Institute for Foundations of Machine Learning (IFML) and a Google Tensor Flow Model Garden Award.

Published as a conference paper at ICLR 2022

Mohamed S Abdelfattah, Abhinav Mehrotra, Łukasz Dudziak, and Nicholas D Lane. Zero-cost proxies for lightweight nas. ar Xiv preprint ar Xiv:2101.08134, 2021.

Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. Co RR, abs/1711.04325, 2017.

Kartikeya Bhardwaj, Guihong Li, and Radu Marculescu. How does topology inﬂuence gradient propagation and model performance of deep networks with densenet-type skip connections? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13498 13507, 2021.

Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-nms improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision, pp. 5561 5569, 2017.

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. ar Xiv preprint ar Xiv:2005.14165, 2020.

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. ar Xiv preprint ar Xiv:2005.12872, 2020.

Boyu Chen, Peixia Li, Chuming Li, Baopu Li, Lei Bai, Chen Lin, Ming Sun, Wanli Ouyang, et al. Glit: Neural architecture search for global and local image transformer. ar Xiv preprint ar Xiv:2107.02960, 2021a.

Chun-Fu Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision transformer for image classiﬁcation. ar Xiv preprint ar Xiv:2103.14899, 2021b.

Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. ar Xiv preprint ar Xiv:2012.00364, 2020.

Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4974 4983, 2019.

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoderdecoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 801 818, 2018.

Minghao Chen, Houwen Peng, Jianlong Fu, and Haibin Ling. Autoformer: Searching transformers for visual recognition. ar Xiv preprint ar Xiv:2107.00651, 2021c.

Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, and Zhangyang Wang. Chasing sparsity in vision transformers: An end-to-end exploration. Advances in Neural Information Processing Systems, 34, 2021d.

Wuyang Chen, Xinyu Gong, and Zhangyang Wang. Neural architecture search on imagenet in four gpu hours: A theoretically inspired perspective. International Conference on Learning Representations (ICLR), 2021e.

Wuyang Chen, Xinyu Gong, Yunchao Wei, Humphrey Shi, Zhicheng Yan, Yi Yang, and Zhangyang Wang. Understanding and accelerating neural architecture search with training-free and theorygrounded metrics. ar Xiv preprint ar Xiv:2108.11939, 2021f.

Minsik Cho, Ulrich Finkler, Sameer Kumar, David Kung, Vaibhav Saxena, and Dheeraj Sreedhar. Powerai ddl. ar Xiv preprint ar Xiv:1708.02188, 2017.

Published as a conference paper at ICLR 2022

Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702 703, 2020.

Zhigang Dai, Bolun Cai, Yugeng Lin, and Junying Chen. Up-detr: Unsupervised pre-training for object detection with transformers. ar Xiv preprint ar Xiv:2011.09094, 2020.

Stéphane d Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio Biroli, and Levent Sagun. Convit: Improving vision transformers with soft convolutional inductive biases. ar Xiv preprint ar Xiv:2103.10697, 2021.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Xuanyi Dong and Yi Yang. Searching for a robust neural architecture in four gpu hours. In Proceedings of the IEEE Conference on computer vision and pattern recognition, pp. 1761 1770, 2019.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

Yonggan Fu, Haoran You, Yang Zhao, Yue Wang, Chaojian Li, Kailash Gopalakrishnan, Zhangyang Wang, and Yingyan Lin. Fractrain: Fractionally squeezing bit savings both temporally and spatially for efﬁcient dnn training. Advances in Neural Information Processing Systems, 33:12127 12139, 2020.

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017.

Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer. ar Xiv preprint ar Xiv:2103.00112, 2021.

Ali Hassani, Steven Walton, Nikhil Shah, Abulikemu Abuduweili, Jiachen Li, and Humphrey Shi. Escaping the big data paradigm with compact transformers. ar Xiv preprint ar Xiv:2104.05704, 2021.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015.

Jiri Hron, Yasaman Bahri, Jascha Sohl-Dickstein, and Roman Novak. Inﬁnite attention: Nngp and ntk for deep attention networks. In International Conference on Machine Learning, pp. 4376 4386. PMLR, 2020.

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European conference on computer vision, pp. 646 661. Springer, 2016.

Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, et al. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. ar Xiv preprint ar Xiv:1807.11205, 2018.

Yifan Jiang, Shiyu Chang, and Zhangyang Wang. Transgan: Two pure transformers can make one strong gan, and that can scale up. Advances in Neural Information Processing Systems, 34, 2021.

Published as a conference paper at ICLR 2022

John M Lee. Riemannian manifolds: an introduction to curvature, volume 176. Springer Science & Business Media, 2006.

Changlin Li, Tao Tang, Guangrun Wang, Jiefeng Peng, Bing Wang, Xiaodan Liang, and Xiaojun Chang. Bossnas: Exploring hybrid cnn-transformers with block-wisely self-supervised neural architecture search. ar Xiv preprint ar Xiv:2103.12424, 2021.

Yi-Lun Liao, Sertac Karaman, and Vivienne Sze. Searching for efﬁcient multi-stage vision transformers. ar Xiv preprint ar Xiv:2109.00642, 2021.

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. ar Xiv preprint ar Xiv:1806.09055, 2018.

Qiang Liu, Lemeng Wu, and Dilin Wang. Splitting steepest descent for growing neural architectures. ar Xiv preprint ar Xiv:1910.02366, 2019a.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019b.

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. ar Xiv preprint ar Xiv:2103.14030, 2021.

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. ar Xiv preprint ar Xiv:1608.03983, 2016.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017.

Joseph Mellor, Jack Turner, Amos Storkey, and Elliot J Crowley. Neural architecture search without training. ar Xiv preprint ar Xiv:2006.04647, 2020.

Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, and Aude Oliva. Iared2: Interpretability-aware redundancy reduction for vision transformers. Advances in Neural Information Processing Systems, 34, 2021.

Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. ar Xiv preprint ar Xiv:1606.05340, 2016.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018.

Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10428 10436, 2020.

Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision transformers see like convolutional neural networks? ar Xiv preprint ar Xiv:2108.08810, 2021.

Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classiﬁer architecture search. In Proceedings of the aaai conference on artiﬁcial intelligence, volume 33, pp. 4780 4789, 2019.

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pp. 843 852, 2017.

Zhiqing Sun, Shengcao Cao, Yiming Yang, and Kris Kitani. Rethinking transformer-based set prediction for object detection. ar Xiv preprint ar Xiv:2011.10881, 2020.

Published as a conference paper at ICLR 2022

Mingxing Tan and Quoc V Le. Efﬁcientnet: Rethinking model scaling for convolutional neural networks. ar Xiv preprint ar Xiv:1905.11946, 2019.

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efﬁcient image transformers & distillation through attention. ar Xiv preprint ar Xiv:2012.12877, 2020.

Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. ar Xiv preprint ar Xiv:2103.17239, 2021.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30:5998 6008, 2017.

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. ar Xiv preprint ar Xiv:2102.12122, 2021.

Yue Wang, Ziyu Jiang, Xiaohan Chen, Pengfei Xu, Yang Zhao, Yingyan Lin, and Zhangyang Wang. E2-train: Training state-of-the-art cnns with over 80% energy savings. ar Xiv preprint ar Xiv:1910.13349, 2019.

Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to-end video instance segmentation with transformers. ar Xiv preprint ar Xiv:2011.14503, 2020.

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992.

Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. ar Xiv preprint ar Xiv:2103.15808, 2021.

Lechao Xiao, Jeffrey Pennington, and Samuel S Schoenholz. Disentangling trainability and generalization in deep learning. ar Xiv preprint ar Xiv:1912.13053, 2019.

Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. ar Xiv preprint ar Xiv:2006.14548, 2020.

Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. Proceedings of the 47th International Conference on Parallel Processing - ICPP 2018, 2018. doi: 10.1145/3225058.3225069. URL http://dx.doi.org/10.1145/3225058. 3225069.

Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, and Wei Wu. Incorporating convolution designs into visual transformers. ar Xiv preprint ar Xiv:2103.11816, 2021a.

Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and Jiashi Feng. Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3903 3911, 2020.

Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. ar Xiv preprint ar Xiv:2101.11986, 2021b.

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classiﬁers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023 6032, 2019.

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. In Neur IPS, 2020.

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. ar Xiv preprint ar Xiv:2106.04560, 2021.

Published as a conference paper at ICLR 2022

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017.

Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, and Jianfeng Gao. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. ar Xiv preprint ar Xiv:2103.15358, 2021.

Minghang Zheng, Peng Gao, Xiaogang Wang, Hongsheng Li, and Hao Dong. End-to-end object detection with adaptive clustering transformer. ar Xiv preprint ar Xiv:2011.09315, 2020.

Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. Deepvit: Towards deeper vision transformer. ar Xiv preprint ar Xiv:2103.11886, 2021.

Zhengguang Zhou, Wengang Zhou, Xutao Lv, Xuan Huang, Xiaoyu Wang, and Houqiang Li. Progressive learning of low-precision networks. ar Xiv preprint ar Xiv:1905.11781, 2019.

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. ar Xiv preprint ar Xiv:2010.04159, 2020.

Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. ar Xiv preprint ar Xiv:1611.01578, 2016.

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697 8710, 2018.

A IMPLEMENTATIONS

Training-free Topology Search and Scaling. We calculate LE by uniformly sampling 10 θs in [0, 2π). For one architecture, the calculation of LE is repeated ﬁve times with different (random) network initializations, and LE is set as their mean.

Image Classiﬁcation. We use 20 epochs of linear warm-up, a batch size of 1,024, an initial learning rate of 0.001, and a weight decay of 0.05. Augmentations including stochastic depth (Huang et al., 2016), Mixup (Zhang et al., 2017), Cutmix (Yun et al., 2019), Rand Aug (Cubuk et al., 2020), Exponential Moving Average (EMA) are also applied.

Object Detection. Our training adopts a batch size of 256 for 36 epochs, with also stochastic depth. We do not use any stronger techniques like HTC (Chen et al., 2019), multi-scale testing, sotf-NMS (Bodla et al., 2017), etc.

B CONVERGENCE OF TRAINING-FREE SEARCH

Figure 5: Entropy of policy during our search (Section 3.3).

To demonstrate the convergence of the policy learned by our RL search, we show the entropy during learning the policy in Figure 5. We can see that a training of 500 steps is enough for the policy to converge to low entropy (high conﬁdence).