# lambdanetworks_modeling_longrange_interactions_without_attention__ae248723.pdf

Published as a conference paper at ICLR 2021

LAMBDANETWORKS: MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION

Irwan Bello Google Research, Brain team ibello@google.com

We present lambda layers an alternative framework to self-attention for capturing long-range interactions between an input and structured contextual information (e.g. a pixel surrounded by other pixels). Lambda layers capture such interactions by transforming available contexts into linear functions, termed lambdas, and applying these linear functions to each input separately. Similar to linear attention, lambda layers bypass expensive attention maps, but in contrast, they model both content and position-based interactions which enables their application to large structured inputs such as images. The resulting neural network architectures, Lambda Networks, signiﬁcantly outperform their convolutional and attentional counterparts on Image Net classiﬁcation, COCO object detection and instance segmentation, while being more computationally efﬁcient. Additionally, we design Lambda Res Nets, a family of hybrid architectures across different scales, that considerably improves the speed-accuracy tradeoff of image classiﬁcation models. Lambda Res Nets reach excellent accuracies on Image Net while being 3.2 - 4.4x faster than the popular Efﬁcient Nets on modern machine learning accelerators. In large-scale semi-supervised training with an additinal 130M pseudo-labeled images, Lambda Res Nets achieve up to 86.7% Image Net accuracy while being 9.5x faster than Efﬁcient Net Noisy Student and 9x faster than a Vision Transformer with comparable accuracies1.

1 INTRODUCTION

Modeling long-range dependencies in data is a central problem in machine learning. Selfattention (Bahdanau et al., 2015; Vaswani et al., 2017) has emerged as a popular approach to do so, but the costly memory requirement of self-attention hinders its application to long sequences and multidimensional data such as images2. Linear (or efﬁcient) attention mechanisms (Katharopoulos et al., 2020; Choromanski et al., 2020) offer a scalable remedy for high memory usage but fail to model internal data structure, such as relative distances between pixels or edge relations between nodes in a graph.

This work addresses both issues. We propose lambda layers which model long-range interactions between a query and a structured set of context elements at a reduced memory cost. Lambda layers transform each available context into a linear function, termed a lambda, which is then directly applied to the corresponding query. Whereas self-attention deﬁnes a similarity kernel between the query and the context elements, a lambda layer instead summarizes contextual information into a ﬁxed-size linear function (i.e. a matrix), thus bypassing the need for memory-intensive attention maps. This difference is illustrated in Figure 1.

Lambda layers are versatile and can be implemented to model both content-based and position-based interactions in global, local or masked contexts. The resulting neural networks, Lambda Networks, are computationally efﬁcient, model long-range dependencies at a small memory cost and can therefore be applied to large structured inputs such as high resolution images.

1An updated version of this paper can be found on ar Xiv. 2For example, applying a single multi-head attention layer to a batch of 128 64x64 input images with 8 heads requires 64GB of memory, which is prohibitive in practice.

Published as a conference paper at ICLR 2021

G LO BAL C O NTE XT

ATTE NTI O N MAP S LAM B DAS LO CAL C O NTE XTS

Q U E R I E S

Figure 1: Comparison between self-attention and lambda layers. (Left) An example of 3 queries and their local contexts within a global context. (Middle) Self-attention associates each query with an attention distribution over its context. (Right) The lambda layer transforms each context into a linear function lambda that is applied to the corresponding query.

We evaluate Lambda Networks on computer vision tasks where works using self-attention are hindered by large memory costs (Wang et al., 2018; Bello et al., 2019), suffer impractical implementations (Ramachandran et al., 2019), or require vast amounts of data (Dosovitskiy et al., 2020). In our experiments spanning Image Net classiﬁcation, COCO object detection and instance segmentation, Lambda Networks signiﬁcantly outperform their convolutional and attentional counterparts, while being more computationally efﬁcient and faster than the latter. We summarize our contributions:

Lambda layers: a class of layers, that model content-based and position-based interactions without materializing attention maps. Lambda layers offer a unifying view of channel, spatial and linear attention (Appendix D.4). Some of our observations, such as the computational beneﬁts of a multi-query formulation, extend to linear attention. Lambda layers are easily implemented with einsum operations and convolution kernels, operations with efﬁcient implementations on modern machine learning accelerators. Lambda layers signiﬁcantly outperform their convolution and attention counterparts on the Image Net classiﬁcation task while being more computationally efﬁcient. For example, simply replacing the 3x3 convolutions in the bottleneck blocks of the Res Net-50 architecture (He et al., 2016) with lambda layers yields a +1.5% top-1 Image Net accuracy improvement while reducing parameters by 40% (Section 5.1). Lambda layers achieve considerable computational beneﬁts, both in latency and memory requirements, over multiple self-attention alternatives, including local and axial attention (Ramachandran et al., 2019; Wang et al., 2020a). When used in a Res Net-50 architecture at image resolution 224, lambda layers reduce memory consumption by 200x compared to global attention ( 7x compared to axial attention) while being 3.7x faster than local attention (Section 5.2). A study of hybrid convolution-lambda models as a means to maximize the speed-accuracy tradeoff (Section 5.3). Hybrid designs that ﬁrst employ convolutions at the highest resolution and lambda layers in intermediate to low resolutions achieve the best speed-accuracy tradeoff. Lambda Res Nets: a family of hybrids based on the training and scaling strategies recommended in Bello et al. (2021). Lambda Res Nets achieve up to a 4.4x speed-up over Efﬁcient Nets on Image Net, while being more memory-efﬁcient. Lambda Res Nets can also be designed for parameter or ﬂops efﬁciency. For example, a Lambda Res Net with 42M parameters achieves 84.3% top-1 Image Net accuracy at image resolution 320 (Section E.4). In large-scale semi-supervised training with an additional 130M pseudo-labeled images, Lambda Res Nets achieve up to 86.7% top-1 Image Net accuracy while being 9.5x faster than Efﬁcient Net Noisy Student (Xie et al., 2020) and 9x faster than a Vision Transformer (Dosovitskiy et al., 2020) with comparable accuracies (Section 5.3). An evaluation of Lambda Res Nets on COCO object detection and instance segmentation using Mask-RCNN (He et al., 2017). Lambda Res Net backbones yield consistent gains across all metrics on both tasks (e.g. +1.8% m AP improvement for detecting small objects).

Published as a conference paper at ICLR 2021

A content-based interaction considers the content of the context but ignores the relation between the query position and the context (e.g. relative distance between two pixels). A position-based interaction considers the relation between the query position and the context position.

Table 1: Deﬁnition of content-based vs position-based interactions.

2 MODELING LONG-RANGE INTERACTIONS

In this section, we formally deﬁne queries, contexts and interactions. Starting from ﬁrst principles, we motivate keys and relative position embeddings as a requirement for capturing structured interactions between queries and their contexts. We then show that lambda layers arise as an alternative to attention mechanisms for capturing long-range interactions.

Notation. We denote scalars, vectors and tensors using lower-case, bold lower-case and bold upper-case letters, e.g., n, x and X. We denote |n| the cardinality of a set whose elements are indexed by n. We denote xn the n-th row of X. We denote xij the |ij| elements of X. When possible, we adopt the terminology of self-attention to ease readability and highlight differences.

2.1 MOTIVATING QUERIES, KEYS, POSITION EMBEDDINGS AND VALUES

Deﬁning queries and contexts. Let Q = {(qn, n)} and C = {(cm, m)} denote structured collections of vectors, respectively referred to as the queries and the context. Each query (qn, n) is characterized by its content qn R|k| and position n. Similarly, each context element (cm, m) is characterized by its content cm and its position m in the context. The (n, m) pair may refer to any pairwise relation between structured elements, e.g. relative distances between pixels or edges between nodes in a graph.

Deﬁning interactions. We consider the general problem of mapping a query (qn, n) to an output vector yn R|v| given the context C with a function F : ((qn, n), C) 7 yn. Such a function may act as a layer in a neural network when processing structured inputs.

We refer to (qn, cm) interactions as content-based and (qn, (n, m)) interactions as position-based. We note that while absolute positional information is sometimes directly added to the query (or context element) content3, we consider this type of interaction to be content-based as it ignores the relation (n, m) between the query and context element positions.

Introducing keys and relative position embeddings to capture long-range interactions. In the context of deep learning, we prioritize fast batched linear operations and use dot-product operations as our interactions. This motivates introducing vectors that can interact with the queries via a dotproduct operation and therefore have the same dimension as the queries. In particular, content-based interactions (qn, cm) require a |k|-dimensional vector that depends on cm, commonly referred to as the key km. Conversely, position-based interactions (qn, (n, m)) require a relative position embedding enm R|k| (Shaw et al., 2018). As the query/key depth |k| and context spatial dimension |m| are not in the output yn R|v|, these dimensions need to be contracted as part of the layer computations. Therefore

Every layer capturing long-range interactions can be characterized based on whether it contracts (1) the query depth or (2) the context positions ﬁrst.

2.2 ATTENTION VS LAMBDA LAYERS.

(1) Attention layers. Contracting the query depth ﬁrst creates a similarity kernel (the attention map) between the query and context elements and is known as the attention operation. As the number of context positions |m| grows larger and the input and output dimensions |k| and |v| remain ﬁxed, one may hypothesize that computing attention maps become wasteful, given that the layer output is a vector of comparatively small dimension |v| |m|.

(2) Lambda layers. Instead, it may be more efﬁcient to simply map each query to its output as yn = F((qn, n), C) = λ(C, n)(qn) for some linear function λ(C, n) : R|k| R|v|. In this

3This approach is often used in natural language processing tasks (Vaswani et al., 2017) but has had limited success in the visual domain where relative position information between pixels is crucial (Bello et al., 2019).

Published as a conference paper at ICLR 2021

Applying the lambdas to the queries Generating lambdas from the context

Matrix multiplication

Linear projection

Name Description

|k|, |v| query, value depth

X R|n| d inputs C R|m| d context

Q = XWQ R|n| |k| queries K = CWK R|m| |k| keys V = CWV R|m| |v| values σ(K) = softmax(K, axis=m) normalized keys

En R|m| |k| relative position embeddings

λc = KT V R|k| |v| content lambda λp n = ET n V R|k| |v| position lambdas λn = λc + λp n R|k| |v| lambdas

Figure 2: Computational graph of the lambda layer. Contextual information for query position n is summarized into a lambda λn R|k| |v|. Applying the lambda dynamically distributes contextual features to produce the output as yn = λT nqn. This process captures content-based and position-based interactions without producing attention maps.

scenario, the context is aggregated into a ﬁxed-size linear function λn = λ(C, n). Each λn acts as a small linear function4 that exists independently of the context (once computed) and is discarded after being applied to its associated query qn.

3 LAMBDA LAYERS

3.1 LAMBDA LAYER: TRANSFORMING CONTEXTS INTO LINEAR FUNCTIONS.

A lambda layer takes the inputs X R|n| din and the context C R|m| dc as input and generates linear function lambdas that are then applied to the queries, yielding outputs Y R|n| dout. Without loss of generality, we assume din = dc = dout = d. As is the case with self-attention, we we may have C = X. In the rest of this paper, we focus on a speciﬁc instance of a lambda layer and show that it captures long-range content and position-based interactions without materializing attention maps. Figure 2 presents the computational graph of the lambda layer.

We ﬁrst describe the lambda layer when applied to a single query (qn, n).

Generating the contextual lambda function. We wish to generate a linear function R|k| R|v|, i.e. a matrix λn R|k| |v|. The lambda layer ﬁrst computes keys K and values V by linearly projecting the context, and keys are normalized across context positions via a softmax operation yielding normalized keys K. The λn matrix is obtained by using the normalized keys K and position embeddings En to aggregate the values V as

m ( km + enm)v T m = K T V | {z } content lambda

+ ET n V | {z } position lambda

R|k| |v| (1)

where we also deﬁne the content lambda λc and position lambda λp n.

The content lambda λc is shared across all query positions n and is invariant to permutation of the context elements. It encodes how to transform the query qn solely based on the context content. The position lambda λp n depends on the query position n via the position embedding En. It encodes how to transform the query qn based on the context elements cm and their relative positions to the query (n, m).

Applying lambda to its query. The query qn R|k| is obtained from the input xn via a learned linear projection and the output of the lambda layer is obtained as

yn = λT nqn = (λc + λp n)T qn R|v|. (2)

4This mechanism is reminiscent of functional programming and λ-calculus which motivates the lambda terminology.

Published as a conference paper at ICLR 2021

Interpretation of lambda layers. The columns of the λn R|k| |v| matrix can be viewed as a ﬁxed-size set of |k| contextual features. These contextual features are aggregated based on the context s content (content-based interactions) and structure (position-based interactions). Applying the lambda then dynamically distributes these contextual features based on the query to produce the output as yn = P

k qnkλnk. This process captures content and position-based interactions without producing attention maps and can be viewed as an efﬁcient relative attention mechanism.

Normalization. One may modify Equations 1 and 2 to include non-linearities or normalization operations. Our experiments indicate that applying batch normalization (Ioffe & Szegedy, 2015) after computing the queries and the values is helpful.

3.2 A MULTI-QUERY FORMULATION TO REDUCE COMPLEXITY.

Complexity analysis. For a batch of |b| examples, each containing |n| inputs, the number of arithmetic operations and memory footprint required to apply our lambda layer are respectively Θ(bnmkv) and Θ(knm + bnkv). We still have a quadratic memory footprint with respect to the input length due to the enm relative position embeddings. However this quadratic term does not scale with the batch size as is the case with the attention operation which produces per-example attention maps. In practice, the hyperparameter |k| is set to a small value (such as |k|=16) and we can process large batches of large inputs in cases where attention cannot (see Table 4). Additionally, position embeddings can be shared across lambda layers to keep their Θ(knm) memory footprint constant - whereas the memory footprint of attention maps scales with the number of layers5.

def lambda layer(queries, keys, embeddings, values):

"""Multi query lambda layer.""" # b: batch, n: input length, m: context length, # k: query/key depth, v: value depth, # h: number of heads, d: output dimension. content lambda = einsum(softmax(keys), values, bmk,bmv >bkv ) position lambdas = einsum(embeddings, values, nmk,bmv >bnkv ) content output = einsum(queries, content lambda, bhnk,bkv >bnhv ) position output = einsum(queries, position lambdas, bhnk,bnkv >bnhv ) output = reshape(content output + position output, [b, n, d]) return output

Figure 3: Pseudo-code for the multi-query lambda layer. The position embeddings can be made to satisfy various conditions, such as translation equivariance, when computing positional lambdas (not shown). The lambda layer can be adapted to other tasks/modalities by adjusting the choice of embeddings (Section A.2).

Multi-query lambda layers reduce time and space complexities. Recall that the lambda layer maps inputs xn Rd to outputs yn Rd. As presented in Equation 2, this implies that |v|=d. Small values of |v| may therefore act as a bottleneck on the feature vector yn but larger output dimensions |v| can incur an excessively large computational cost given our Θ(bnmkv) and Θ(knm + bnkv) time and space complexities.

We propose to decouple the time and space complexities of our lambda layer from the output dimension d. Rather than imposing |v|=d, we create |h| queries {qh n}, apply the same lambda λn to each query qh n, and concatenate the outputs as yn = concat(λnq1 n, , λnq|h| n ). We now have |v|=d/|h|, which reduces complexity by a factor of |h|. The number of heads |h| controls the size of the lambdas λn R|k| |d|/|h| relative to the total size of the queries qn R|hk|.

We refer to this operation as a multi-query lambda layer and present an implementation using einsum6 in Figure 3. The lambda layer is robust to |k| and |h| hyperparameter choices (see Appendix E.1), which enables ﬂexibility in controlling its complexity. We use |h|=4 in most experi-

5Attention maps typically need to be stored for back-propagation (Kitaev et al., 2020). 6The einsum operation denotes general contractions between tensors of arbitrary dimensions. It is numerically equivalent to broadcasting its inputs to share the union of their dimensions, multiplying element-wise and summing across all dimensions not speciﬁed in the output.

Published as a conference paper at ICLR 2021

Operation Head Interactions Time Space conﬁguration complexity complexity

Attention multi-head content-only Θ(bnm(hk + d)) Θ(bhnm) Relative attention multi-head content & position Θ(bnm(hk + d)) Θ(bhnm) Linear attention multi-head content-only Θ(bnkd) Θ(bkd)

Lambda layer multi-query content & position Θ(bnmkd/h) Θ(knm + bnkd/h) Lambda convolution multi-query content & position Θ(bnrkd/h) Θ(kr + bnkd/h)

Table 2: Alternatives for capturing long-range interactions. The lambda layer captures content and position-based interactions at a reduced memory cost compared to relative attention (Shaw et al., 2018; Bello et al., 2019). Using a multi-query lambda layer reduces complexities by a factor of |h|. Additionally, position-based interactions can be restricted to a local scope by using the lambda convolution which has linear complexity. b: batch size, h: number of heads/queries, n: input length, m: context length, r: local scope size, k: query/key depth, d: dimension output.

ments. We note that while this resembles the multi-head or multi-query (Shazeer, 2019)7 attention formulation, the motivation is different. Using multiple queries in the attention operation increases representational power and complexity. In contrast, using multiple queries in the lambda layer decreases complexity and representational power (ignoring the additional queries).

Extending the multi-query formulation to linear attention. Finally, we point that our analysis extends to linear attention which can be viewed as a content-only lambda layer (see Appendix D.3 for a detailed discussion). We anticipate that the multi-query formulation can also bring computational beneﬁts to linear attention mechanisms.

3.3 MAKING LAMBDA LAYERS TRANSLATION EQUIVARIANT.

Using relative position embeddings enm enables making explicit assumptions about the structure of the context. In particular, translation equivariance (i.e. the property that shifting the inputs results in an equivalent shift of the outputs) is a strong inductive bias in many learning scenarios. We obtain translation equivariance in position interactions by ensuring that the position embeddings satisfy enm = et(n)t(m) for any translation t. In practice, we deﬁne a tensor of relative position embeddings R R|r| |k|, where r indexes the possible relative positions for all (n, m) pairs, and reindex8 it into E R|n| |m| |k| such that enm = rr(n,m).

3.4 LAMBDA CONVOLUTION: LOCAL CONTEXTS ON THE GRID.

Despite the beneﬁts of long-range interactions, locality remains a strong inductive bias in many tasks. Using global contexts may prove noisy or computationally excessive. It may therefore be useful to restrict the scope of position interactions to a local neighborhood around the query position n as is the case for local self-attention and convolutions. This can be done by zeroing out the relative embeddings for context positions m outside of the desired scope. However, this strategy remains costly for large values of |m| since the computations still occur - they are only being zeroed out.

Lambda convolution In the case where the context is arranged in a multidimensional grid, we can equivalently compute positional lambdas from local contexts by using a regular convolution. We term this operation the lambda convolution. A n-dimensional lambda convolution can be implemented using an n-d depthwise convolution with channel multiplier or (n+1)-d convolution that treats the v dimension in V as an extra spatial dimension. We present both implementations in Appendix C.1.

As the computations are now restricted to a local scope, the lambda convolution obtains linear time and memory complexities with respect to the input length9. The lambda convolution is readily

7 (Shazeer, 2019) proposes a multi-query formulation to speed-up attention-based decoding. 8We refer the reader to the code for more details. 9FLOPs (time complexity) is not necessarily a good proxy for latency on TPUs/GPUs. Eventhough the lambda convolution has linear time/space complexities, it can be slower than than the global lambda layer in practice, especially for large convolution scope sizes. See Table 4 for an example.

Published as a conference paper at ICLR 2021

usable with additional functionalities such as dilation and striding and enjoys optimized implementations on specialized hardware accelerators (Nickolls & Dally, 2010; Jouppi et al., 2017). This is in stark contrast to implementations of local self-attention that require materializing feature patches of overlapping query and context blocks (Parmar et al., 2018; Ramachandran et al., 2019), increasing memory consumption and latency (see Table 4).

4 RELATED WORK

Table 2 reviews alternatives for capturing long-range interactions and contrasts them with the proposed multi-query lambda layer. We discuss related works in details in the Appendix D. Channel and linear attention The lambda abstraction, i.e. transforming available contexts into linear functions that are applied to queries, is quite general and therefore encompasses many previous works. Closest to our work are channel and linear attention mechanisms (Hu et al., 2018c; Katharopoulos et al., 2020; Choromanski et al., 2020). Such mechanisms also capture long-range interactions without materializing attention maps and can be viewed as speciﬁc instances of a contentonly lambda layer. Lambda layers formalize and extend such approaches to consider both contentbased and position-based interactions, enabling their use as a stand-alone layer on highly structured data such as images. Rather than attempting to closely approximate an attention kernel as is the case with linear attention, we focus on the efﬁcient design of contextual lambda functions and repurpose a multi-query formulation (Shazeer, 2019) to further reduce computational costs. Self-attention in the visual domain In contrast to natural language processing tasks where it is now the de-facto standard, self-attention has enjoyed steady but slower adoption in the visual domain (Wang et al., 2018; Bello et al., 2019; Ramachandran et al., 2019; Carion et al., 2020). Concurrently to this work, Dosovitskiy et al. (2020) achieve a strong 88.6% accuracy on Image Net by pre-training a Transformer on sequences of image patches on a large-scale dataset of 300M images.

5 EXPERIMENTS

In subsequent experiments, we evaluate lambda layers on standard computer vision benchmarks: Image Net classiﬁcation (Deng et al., 2009), COCO object detection and instance segmentation (Lin et al., 2014). The visual domain is well-suited to showcase the ﬂexibility of lambda layers since (1) the memory footprint of self-attention becomes problematic for high-resolution imagery and (2) images are highly structured, making position-based interactions crucial. Lambda Res Nets We construct Lambda Res Nets by replacing the 3x3 convolutions in the bottleneck blocks of the Res Net architecture (He et al., 2016). When replacing all such convolutions, we simply denote the name of the layer being tested (e.g. conv + channel attention or lambda layer). We denote Lambda Res Nets the family of hybrid architectures described in Table 19 (Appendix F.2). Unless speciﬁed otherwise, all lambda layers use |k|=16, |h|=4 with a scope size of |m|=23x23 and are implemented as in Figure 3. Additional experiments and details can be found in the Appendix.

5.1 LAMBDA LAYERS OUTPERFORM CONVOLUTIONS AND ATTENTION LAYERS

We ﬁrst consider the standard Res Net-50 architecture with input image size 224x224. In Table 3, we compare the lambda layer against (a) the standard convolution (i.e. the baseline Res Net-50) (b) channel attention (squeeze-and-excitation) and (c) multiple self-attention variants. The lambda layer strongly outperforms all baselines at a fraction of the parameter cost and notably obtains a +0.8% improvement over channel attention.

5.2 COMPUTATIONAL BENEFITS OF LAMBDA LAYERS OVER SELF-ATTENTION

In Table 4, we compare lambda layers against self-attention and present throughputs, memory complexities and Image Net accuracies. Our results highlight the weaknesses of self-attention: selfattention cannot model global interactions due to large memory costs, axial self-attention is still memory expensive and local self-attention is prohibitively slow. In contrast, the lambda layer can capture global interactions on high-resolution images and obtains a +1.0% improvement over local self-attention while being almost 3x faster10. Additionally, positional embeddings can be shared

10Latencies for local self-attention were provided privately by Ramachandran et al. (2019) based on an implementation that relies on query blocks and overlapping memory blocks (Parmar et al., 2018). Specialized attention kernels may greatly speed up local self-attention, making it a promising avenue for future research.

Published as a conference paper at ICLR 2021

Layer Params (M) top-1

Conv (He et al., 2016) 25.6 76.9

Conv + channel attention (Hu et al., 2018c) 28.1 77.6 (+0.7)

Conv + linear attention (Chen et al., 2018) 33.0 77.0 Conv + linear attention (Shen et al., 2018) - 77.3 (+1.2) Conv + relative self-attention (Bello et al., 2019) 25.8 77.7 (+1.3)

Local relative self-attention (Ramachandran et al., 2019) 18.0 77.4 (+0.5) Local relative self-attention (Hu et al., 2019) 23.3 77.3 (+1.0) Local relative self-attention (Zhao et al., 2020) 20.5 78.2 (+1.3)

Lambda layer 15.0 78.4 (+1.5) Lambda layer (|u|=4) 16.0 78.9 (+2.0)

Table 3: Comparison of the lambda layer and attention mechanisms on Image Net classiﬁcation with a Res Net50 architecture. The lambda layer strongly outperforms attention alternatives at a fraction of the parameter cost. All models are trained in mostly similar setups (see Appendix F.3) and we include the reported improvements compared to the convolution baseline in parentheses. See Appendix C.4 for a description of the |u| hyperparameter. Our implementation.

across lambda layers to further reduce memory requirements, at a minimal degradation cost. Finally, the lambda convolution has linear memory complexity, which becomes practical for very large images as seen in detection or segmentation. We also ﬁnd that the lambda layer outperforms local self-attention when controlling for the scope size11 (78.1% vs 77.4% for |m|=7x7), suggesting that the beneﬁts of the lambda layer go beyond improved speed and scalability.

Layer Space Complexity Memory (GB) Throughput top-1

Global self-attention Θ(blhn2) 120 OOM OOM Axial self-attention Θ(blhn n) 4.8 960 ex/s 77.5 Local self-attention (7x7) Θ(blhnm) - 440 ex/s 77.4

Lambda layer Θ(lkn2) 1.9 1160ex/s 78.4 Lambda layer (|k|=8) Θ(lkn2) 0.95 1640 ex/s 77.9 Lambda layer (shared embeddings) Θ(kn2) 0.63 1210 ex/s 78.0 Lambda convolution (7x7) Θ(lknm) - 1100 ex/s 78.1

Table 4: The lambda layer reaches higher Image Net accuracies while being faster and more memory-efﬁcient than self-attention alternatives. Memory is reported assuming full precision for a batch of 128 inputs using default hyperparameters. The memory cost for storing the lambdas matches the memory cost of activations in the rest of the network and is therefore ignored. b: batch size, h: number of heads/queries, n: input length, m: context length, k: query/key depth, l: number of layers.

5.3 HYBRIDS IMPROVE THE SPEED-ACCURACY TRADEOFF OF IMAGE CLASSIFICATION Studying hybrid architectures. In spite of the memory savings compared to self-attention, capturing global contexts with the lambda layer still incurs a quadratic time complexity (Table 2), which remains costly at high resolution. In Appendix 5.3, we study hybrid designs that use standard convolutions to capture local contexts and lambda layers to capture global contexts. We ﬁnd that such convolution-lambda hybrids have increased representational power at a negligible decrease in throughput compared to their purely convolutional counterparts.

Lambda Res Nets signiﬁcantly improve the speed-accuracy tradeoff of Image Net classiﬁcation. We design a family of hybrids based on our study of hybrid architectures and the scaling/training strategies from Bello et al. (2021) (Section F.2). Figure 4 presents the speed-accuracy Pareto curve of Lambda Res Nets compared to Efﬁcient Nets (Tan & Le, 2019) on TPUv3 hardware. In order to isolate the beneﬁts of lambda layers, we additionally compare against the same architectures when replacing lambda layers by (1) standard 3x3 convolutions (denoted Res Net-RS wo/ SE) and (2) 3x3

11Note that the content-based lambda still captures global interactions.

Published as a conference paper at ICLR 2021

convolutions with squeeze-and-excitation (denoted Res Net-RS w/ SE). All architectures are trained for 350 epochs using the same regularization methods and evaluated at the same resolution they are trained at. Lambda Res Nets outperform the baselines across all scales on the speed-accuracy trade-off.

0 1 2 3 4 5 6 Time per training step (s) for 1024 images.

Top-1 Image Net accuracy

B7 3.2x speed-up 4.4x speed-up

Speed-Accuracy Pareto Curve

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Lambda Res Net Efficient Net Res Net-RS w/ SE Res Net-RS wo/ SE

Figure 4: Speed-accuracy comparison between Lambda Res Nets and Efﬁcient Nets with matching training and regularization setups. Lambda Res Nets (annotated with (depth, image size)) are 3.24.4x faster than Efﬁcient Nets and 1.6-2.3x faster than Res Net-RS with squeeze-and-excitation, thus signiﬁcantly improving the speed-accuracy Pareto curve of image classiﬁcation. Lambda Res Net420 (image size 320), reaches a strong 84.9% top-1 accuracy, 0.9% over the corresponding architecture with standard 3x3 convolutions and 0.65% over the corresponding architecture with squeezeand-excitation.

Scaling to larger datasets with pseudo-labels We train Lambda Res Nets in a semi-supervised learning setting using 130M pseudo-labeled images from the JFT dataset, as done for training the Efﬁcient Net-Noisy Student checkpoints (Xie et al., 2020). Table 5 compares the throughputs and Image Net accuracies of a representative set of models with similar accuracies when trained using the JFT dataset. Lambda Res Net-152, trained and evaluated at image size 288, achieves a strong 86.7% top-1 Image Net accuracy while being more parameter-efﬁcient and 9.5x faster than the Efﬁcient Net Noisy Student checkpoint with the same accuracy.

Architecture Params (M) Train (ex/s) Infer (ex/s) Image Net top-1

Lambda Res Net-152 51 1620 6100 86.7 Efﬁcient Net-B7 66 170 (9.5x) 980 (6.2x) 86.7 Vi T-L/16 307 180 (9.0x) 640 (9.5x) 87.1

Table 5: Comparison of models trained on extra data. Vi T-L/16 is pre-trained on JFT and ﬁnetuned on Image Net at resolution 384x384, while Efﬁcient Net and Lambda Res Net are co-trained on Image Net and JFT pseudo-labels. Training and inference throughput is shown for 8 TPUv3 cores.

6 CONCLUSION

We propose a new class of layers, termed lambda layers, which provide a scalable framework for capturing structured interactions between inputs and their contexts. Lambda layers summarize available contexts into ﬁxed-size linear functions, termed lambdas, that are directly applied to their associated queries. The resulting neural networks, Lambda Networks, are computationally efﬁcient and capture long-range dependencies at a small memory cost, enabling their application to large structured inputs such as high-resolution images. Extensive experiments on computer vision tasks showcase their versatility and superiority over convolutional and attentional networks. We introduce Lambda Res Nets, a family of hybrid Lambda Networks which reach excellent Image Net accuracies and achieve up to 9.5x speed-ups over the popular Efﬁcient Nets and Vision Transformers, signiﬁcantly improving the speed-accuracy tradeoff of image classiﬁcation models.

Published as a conference paper at ICLR 2021

ACKNOWLEDGMENTS

The author would like to thank Barret Zoph and William Fedus for endless discussions, fruitful suggestions and careful revisions; Jonathon Shlens, Mike Mozer, Prajit Ramachandran, Ashish Vaswani, Quoc Le, Neil Housby, Jakob Uszkoreit, Margaret Li, Krzysztof Choromanski for many insightful comments; Hedvig Rausing for the antarctic infographics; Zolan Brinnes for the OST; Andrew Brock, Sheng Li for assistance with proﬁling Efﬁcient Nets; Adam Kraft, Thang Luong and Hieu Pham for assistance with the semi-supervised experiments and the Google Brain team for useful discussions on the paper.

Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z. Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past. 2016.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2015.

Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. 2016. URL http://arxiv.org/abs/1611. 09940.

Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V. Le. Attention augmented convolutional networks. Co RR, abs/1904.09925, 2019. URL http://arxiv.org/abs/ 1904.09925.

Irwan Bello, William Fedus, Xianzhi Du, Ekin D. Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, and Barret Zoph. Revisiting resnets: Improved training methodologies and scaling rules. 2021.

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. 2020.

Denny Britz, Melody Y. Guan, and Minh-Thang Luong. Efﬁcient attention using a ﬁxed-size memory representation. Co RR, abs/1707.00110, 2017. URL http://arxiv.org/abs/1707. 00110.

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high ﬁdelity natural image synthesis. 2019.

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. 2020.

Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, and Ilya Sutskever. Generative pretraining from pixels. 2020a. URL https://openai.com/ blog/image-gpt/.

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. 2020b.

Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. A2-nets: Double attention networks. Co RR, abs/1810.11579, 2018. URL http://arxiv.org/abs/1810. 11579.

Rewon Child, Scott Gray, Alec Radford, and Sutskever Ilya. Generating long sequences with sparse transformers. ar Xiv preprint ar Xiv:1904.10509.

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with performers. 2020.

Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. On the relationship between selfattention and convolutional layers. 2019. URL http://arxiv.org/abs/1911.03584.

Published as a conference paper at ICLR 2021

Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. Randaugment: Practical automated data augmentation with a reduced search space. 2019.

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a ﬁxed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019. doi: 10.18653/v1/P19-1285. URL https://www.aclweb.org/ anthology/P19-1285.

Alexandre de Br ebisson and Pascal Vincent. A cheap linear attention mechanism with fast lookups and ﬁxed-size representations. 2016.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. 2020.

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efﬁcient sparsity. 2021.

David Ha, Andrew M. Dai, and Quoc V. Le. Hypernetworks. Co RR, abs/1609.09106, 2016. URL

http://arxiv.org/abs/1609.09106.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.

Kaiming He, Georgia Gkioxari, Piotr Doll ar, and Ross Girshick. Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980 2988, 2017.

Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks for image classiﬁcation with convolutional neural networks. 2018.

Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimensional transformers. ar Xiv preprint ar Xiv:1912.12180, 2019.

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Adam Hartwig. Searching for mobilenetv3. 2019.

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications. ar Xiv preprint ar Xiv:1704.04861, 2017.

Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. 2018a.

Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Local relation networks for image recognition. ar Xiv preprint ar Xiv:1904.11491, 2019.

Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Andrea Vedaldi. Gather-excite: Exploiting feature context in convolutional neural networks. In Advances in Neural Information Processing Systems, 2018b.

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018c.

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. 2016.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Learning Representations, 2015.

Published as a conference paper at ICLR 2021

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon Mac Kean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit. SIGARCH Comput. Archit. News, 45(2):1 12, June 2017. ISSN 0163-5964. doi: 10.1145/ 3140659.3080246. URL http://doi.acm.org/10.1145/3140659.3080246.

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. 2020.

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efﬁcient transformer. ar Xiv preprint ar Xiv:2001.04451, 2020.

Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning, 2020.

Jungkyu Lee, Taeryun Won, Tae Kwan Lee, Hyemin Lee, Geonmo Gu, and Kiho Hong. Compounding the performance improvements of assembled techniques in a convolutional neural network, 2020.

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. 2019.

Xingyu Liao, Lingxiao He, Zhouwang Yang, and Chi Zhang. Video-based person re-identiﬁcation via 3d convolutional networks and non-local attention. 2019.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pp. 740 755. Springer, 2014.

Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention. 2020.

Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. 2019.

John Nickolls and William J Dally. The gpu computing era. IEEE micro, 30(2):56 69, 2010.

Jongchan Park, Sanghyun Woo, Joon-Young Lee, and In So Kweon. Bam: bottleneck attention module. In British Machine Vision Conference, 2018.

Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In International Conference on Machine Learning, 2018.

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. Film: Visual reasoning with a general conditioning layer. Co RR, abs/1709.07871, 2017.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. 2021. URL https://openai.com/blog/clip/.

Published as a conference paper at ICLR 2021

Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. Stand-alone self-attention in vision models. Co RR, abs/1906.05909, 2019. URL http: //arxiv.org/abs/1906.05909.

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510 4520, 2018.

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. ar Xiv preprint ar Xiv:1803.02155, 2018.

Noam Shazeer. Fast transformer decoding: One write-head is all you need. 2019.

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. Co RR, abs/1701.06538, 2017. URL http://arxiv.org/abs/1701.06538.

Zhuoran Shen, Mingyuan Zhang, Shuai Yi, Junjie Yan, and Haiyu Zhao. Efﬁcient attention: Selfattention with linear complexities. Co RR, abs/1812.01243, 2018. URL http://arxiv.org/ abs/1812.01243.

Zhuoran Shen, Irwan Bello, Raviteja Vemulapalli, Xuhui Jia, and Ching-Hui Chen. Global selfattention networks for image recognition, 2020.

Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck transformers for visual recognition. 2021.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overﬁtting. Journal of Machine Learning Research, 15(56):1929 1958, 2014. URL http://jmlr.org/papers/v15/ srivastava14a.html.

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. 2019. URL http://arxiv.org/ abs/1904.01766.

Mingxing Tan and Quoc V. Le. Efﬁcientnet: Rethinking model scaling for convolutional neural networks. Co RR, abs/1905.11946, 2019. URL http://arxiv.org/abs/1905.11946.

Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efﬁcient transformers: A survey. 2020.

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv e J egou. Training data-efﬁcient image transformers & distillation through attention. 2021.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998 6008, 2017.

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural Information Processing Systems, 2015.

Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. 2020a.

Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. 2020b.

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794 7803, 2018.

Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3 19, 2018.

Published as a conference paper at ICLR 2021

Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. Visual transformers: Token-based image representation and processing for computer vision. 2020.

Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V. Le. Self-training with noisy student improves imagenet classiﬁcation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.

Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nystr omformer: A nystr om-based algorithm for approximating self-attention. 2021.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. Proceedings of Machine Learning Research, pp. 2048 2057. PMLR, 2015.

Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. 2019.

Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.

Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017.

Published as a conference paper at ICLR 2021

1 Introduction 1

2 Modeling Long-Range Interactions 3

2.1 Motivating queries, keys, position embeddings and values . . . . . . . . . . . . . . 3

2.2 Attention vs lambda layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Lambda Layers 4

3.1 Lambda layer: transforming contexts into linear functions. . . . . . . . . . . . . . 4

3.2 A multi-query formulation to reduce complexity. . . . . . . . . . . . . . . . . . . 5

3.3 Making lambda layers translation equivariant. . . . . . . . . . . . . . . . . . . . . 6

3.4 Lambda convolution: local contexts on the grid. . . . . . . . . . . . . . . . . . . . 6

4 Related Work 7

5 Experiments 7

5.1 Lambda layers outperform convolutions and attention layers . . . . . . . . . . . . 7

5.2 Computational beneﬁts of lambda layers over self-attention . . . . . . . . . . . . 7

5.3 Hybrids improve the speed-accuracy tradeoff of image classiﬁcation . . . . . . . . 8

6 Conclusion 9

A Discussion 16

A.1 General discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.2 Extending lambda layers to other modalities . . . . . . . . . . . . . . . . . . . . . 16

B Practical Modeling Recommendations 17

C Additional Variants 18

C.1 Complete code with lambda convolution . . . . . . . . . . . . . . . . . . . . . . . 18

C.2 Generating lambdas from masked contexts . . . . . . . . . . . . . . . . . . . . . . 18

C.3 Multi-head vs multi-query lambda layers . . . . . . . . . . . . . . . . . . . . . . . 18

C.4 Adding expressivity with an extra dimension . . . . . . . . . . . . . . . . . . . . . 19

D Additional Related Work 20

D.1 Softmax attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

D.2 Sparse attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

D.3 Linear attention: connections and differences . . . . . . . . . . . . . . . . . . . . 21

D.4 Casting channel and spatial attention as lambda layers. . . . . . . . . . . . . . . . 22

D.5 Self-Attention in the visual domain . . . . . . . . . . . . . . . . . . . . . . . . . 22

D.6 Hyper Networks, expert models and context-dependent weights . . . . . . . . . . . 23

E Additional Experiments 24

E.1 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

E.2 Hybrid models study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

E.3 Object detection and instance segmentation results . . . . . . . . . . . . . . . . . 26

E.4 Parameter and FLOPs efﬁciency results . . . . . . . . . . . . . . . . . . . . . . . 27

F Experimental Details 29

F.1 Detailed Lambda Res Nets results . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

F.2 Architectural details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

F.3 Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Published as a conference paper at ICLR 2021

A DISCUSSION

A.1 GENERAL DISCUSSION

How do lambda layers compare to the attention operation? Lambda layers scale favorably compared to self-attention. Vanilla Transformers using self-attention have Θ(blhn2) memory footprint, whereas Lambda Networks have Θ(lkn2) memory footprint (or Θ(kn2) when sharing positional embeddings across layers). This enables the use of lambda layers at higher-resolution and on larger batch sizes. Additionally, the lambda convolution enjoys a simpler and faster implementation than its local self-attention counterpart. Finally, our Image Net experiments show that lambda layers outperforms self-attention, demonstrating that the beneﬁts of lambda layers go beyond improved speed and scalability.

How are lambda layers different than linear attention mechanisms? Lambda layers generalize and extend linear attention formulations to capture position-based interactions, which is crucial for modeling highly structured inputs such as images (see Table 9 in Appendix E.1). As the aim is not to approximate an attention kernel, lambda layers allow for more ﬂexible non-linearities and normalizations which we also ﬁnd beneﬁcial (see Table 11 in Appendix E.1). Finally, we propose multi-query lambda layers as a means to reduce complexity compared to the multi-head (or single-head) formulation typically used in linear attention works. Appendix D.3 presents a detailed discussion of linear attention.

How to best use lambda layers in the visual domain? The improved scalability, speed and ease of implementation of lambda layers compared to global or local attention makes them a strong candidate for use in the visual domain. Our ablations demonstrate that lambda layers are most beneﬁcial in the intermediate and low-resolution stages of vision architectures when optimizing for the speed-accuracy tradeoff. It is also possible to design architectures that rely exclusively on lambda layers which can be more parameter and ﬂops efﬁcient. We discuss practical modeling recommendations in Appendix B.

A.2 EXTENDING LAMBDA LAYERS TO OTHER MODALITIES

While this work focuses on static image recognition, we note that lambda layers may be instantiated to model structured interactions on structures as diverse as graphs, time series, spatial lattices, etc. We anticipate that lambda layers will be helpful in more modalities, including multimodal tasks. We discuss masked contexts and auto-regressive tasks in the Appendix C.2.

Lambda layers can be instantiated on other tasks simply by adapting the choice of structural/position embeddings to the task of interest and following the pseudo-code presented in Figure 3. The choice of embeddings dictates the memory costs of the lambda layer. The assumption underlying the Θ(knm) space complexity of the lambda layer (Section 3.2) is that all examples in the batch share the same structure, i.e. relative position embeddings have shape k n m. This assumption does not hold when the data structure is different across examples (e.g. graphs with variable edge relations between nodes), in which case embeddings have shape b k n m. In such cases, the lambda layer has Θ(bknm) space complexity, similar to self-attention.

Published as a conference paper at ICLR 2021

B PRACTICAL MODELING RECOMMENDATIONS

I want to make it faster on TPUs/GPUs... Hybrid models reach a better speed-accuracy tradeoff. Global contexts can be computationally wasteful, especially in the early high resolution layers where features lack semantic information, and can be replaced by lambda convolutions with smaller scopes (e.g. |m|=5x5 or 7x7) or the standard 3x3 convolution. Additionally, using a hybrid can require less tuning when starting from a working model/training setup.

I want to make to minimize FLOPS (e.g. embedded applications)... Consider a hybrid with inverted bottlenecks, as done in Section E.4.2. To further reduce FLOPS, prefer lambda convolutions with smaller scopes (e.g. |m|=5x5 or 7x7).

I encounter memory issues... Memory footprint can be reduced by sharing position embeddings across layers (especially layers with the highest resolution). Using the lambda convolution is more memory efﬁcient. Reducing the query depth |k| or increasing the number of heads |h| also decreases memory consumption.

I m experiencing instability... We found it important to initialize the γ parameter in the last batchnorm layer of the Res Net s bottleneck blocks to 0 (this is the default in most codebases). Normalizing the keys (i.e. with the softmax) along the context s length is important. Early experiments which employed 2 lambda layers sequentially in the same residual block were unstable, suggesting that using 2 lambda layers in sequence should be avoided.

Which implementation of the lambda convolution should I use? In our experiments using Tensorﬂow 1.x on TPUv3 hardware, we found both the n-d depthwise and (n+1)-d convolution implementations to have similar speed. We point out that this can vary across software/hardware stacks.

What if my task doesn t require position-based interactions? Computational costs in the lambda layer are dominated by position-based interactions. If your task doesn t require them, you can try the content-only lambda layer or any other linear attention mechanism. We recommend using the multi-query formulation (as opposed to the usual multi-head) and scaling other dimensions of the model.

Published as a conference paper at ICLR 2021

C ADDITIONAL VARIANTS

C.1 COMPLETE CODE WITH LAMBDA CONVOLUTION

# b: batch, n: input length, m: context length, r: scope size, # k: query/key depth, v: value depth, h: number of heads, d: output dimension. def compute position lambdas(embeddings, values, impl= einsum ):

if impl == einsum : # embeddings shape: [n, m, k]

position lambdas = einsum(embeddings, values, nmk,bmv >bnkv ) else: # embeddings shape: [r, k]

if impl == conv :

embeddings = reshape(embeddings, [r, 1, 1, k]) values = reshape(values, [b, n, v, 1]) position lambdas = conv2d(values, embeddings) elif impl == depthwise conv :

# Reshape and tile embeddings to [r, v, k] shape embeddings = reshape(embeddings, [r, 1, k]) embeddings = tile(embeddings, [1, v, 1]) position lambdas = depthwise conv1d(values, embeddings) # Transpose from shape [b, n, v, k] to shape [b, n, k, v] position lambdas = transpose(position lambdas, [0, 1, 3, 2]) return position lambdas

def lambda layer(queries, keys, embeddings, values, impl= einsum ):

"""Multi query lambda layer.""" content lambda = einsum(softmax(keys), values, bmk,bmv >bkv ) position lambdas = compute position lambdas(embeddings, values, impl=impl) content output = einsum(queries, content lambda, bhnk,bkv >bnhv ) position output = einsum(queries, position lambdas, bhnk,bnkv >bnhv ) output = reshape(content output + position output, [b, n, d]) return output

Figure 5: Pseudo-code for the multi-query lambda layer and the 1d lambda convolution. A n-d lambda convolution can equivalently be implemented via a regular (n+1)-d convolution or a nd depthwise convolution with channel multiplier. The embeddings can be made to satisfy various conditions (e.g. translation equivariance and masking) when computing positional lambdas with the einsum implementation.

C.2 GENERATING LAMBDAS FROM MASKED CONTEXTS

In some applications, such as denoising tasks or auto-regressive training, it is necessary to restrict interactions to a sub-context Cn C when generating λn for query position n. For example, parallel auto-regressive training requires masking the future to ensure that the output yn only depends on past context positions m < n. Self-attention achieves this by zeroing out the irrelevant attention weights anm = 0 m / Cn, thus guaranteeing that yn = P m anmvm only depends on Cn.

Similarly, one can block interactions between queries and masked context positions when generating lambdas by applying a mask before summing the contributions of context positions. As long as the mask is shared across all elements in the batch, computing masked lambdas does not require materializing per-example attention maps and the complexities are the same as for global context case. See Figure 6 for an implementation.

C.3 MULTI-HEAD VS MULTI-QUERY LAMBDA LAYERS

In this section, we motivate using a multi-query formulation as opposed to the usual multi-head formulation used in self-attention. Figure 7 presents the implementation of a multi-head lambda layer. Table 6 compares complexities for multi-head and multi-query lambda layers. Using a multi-query formulation reduces computations by a factor of |h| (the number of queries per lambda) compared to the multi-head formulation. We also found in early experimentation that multi-query lambdas yield a better speed-accuracy trade-off. Additionally, the multi-head lambda layer does not enjoy a simple local implementation as the lambda convolution.

Published as a conference paper at ICLR 2021

def masked lambda layer(queries, normalized keys, embeddings, values, mask):

"""Masked multi query lambda layer. Args:

queries: a tensor with shape [b, h, n, k]. normalized keys: a tensor with shape [b, m, k]. embeddings: a tensor with shape [k, n, m]. values: a tensor with shape [b, m, v]. mask: a tensor of 0 and 1s with shape [n, m]. """ # We show the general case but a cumulative sum may be faster for masking the future. # Note that each query now also has its own content lambda since every query # interacts with a different context. # Keys should be normalized by only considering the elements in their contexts. content mu = einsum(normalized keys, values, bmk,bmv >bmkv ) content lambdas = einsum(content mu, mask, bmkv,nm >bnkv ) embeddings = einsum(embeddings, mask, knm,nm >knm ) # apply mask to embeddings position lambdas = einsum(embeddings, values, knm,bmv >bnkv ) content output = einsum(queries, content lambda, bhnk,bnkv >bnhv ) position output = einsum(queries, position lambdas, bhnk,bnkv >bnhv ) output = reshape(content output + position output, [b, n, d]) return output

Figure 6: Pseudo-code for masked multi-query lambda layer.

def multihead lambda layer(queries, keys, embeddings, values, impl= einsum ):

"""Multi head lambda layer.""" content lambda = einsum(softmax(keys), values, bhmk,bhmv >bhkv ) position lambdas = einsum(embeddings, values, hnmk,bhmv >bnhkv ) content output = einsum(queries, content lambda, bhnk,bhkv >bnhv ) position output = einsum(queries, position lambdas, bhnk,bnkv >bnhv ) output = reshape(content output + position output, [b, n, d]) return output

Figure 7: Pseudo-code for the multi-head lambda layer. This is only shown as an example as we recommend the multi-query lambda laayer instead.

Operation Time complexity Space complexity

Multi-head lambda layer Θ(bnmkd) Θ(knm + bnkd) Multi-query lambda layer Θ(bnmkd/h) Θ(hknm + bnkd/h)

Table 6: Complexity comparison between a multi-head and a multi-query lambda layer. Using a multi-query formulation reduces complexity by a factor |h| (the number of queries per lambda) compared to the standard multi-head formulation.

C.4 ADDING EXPRESSIVITY WITH AN EXTRA DIMENSION

We brieﬂy experiment with a variant that enables increasing the cost of computing the lambdas while keeping the cost of applying them constant. This is achieved by introducing an additional dimension, termed the intra-depth with corresponding hyperparameter |u|, in keys, position embeddings and values. Each key (or positional embedding) is now a |k| |u| matrix instead of a |k|-dimensional vector. Similarly, each value is now a |v| |u| matrix instead of a |v|-dimensional vector. The lambdas are obtained via summing over context positions and the intra-depth position |u| and have |k| |v| shape similar to the default case. See Figure 8 for an implementation and Table 7 for the complexities. Experiments (see Appendix E.1) demonstrate that this variant results in accuracy improvements but we ﬁnd that using |u|=1 (i.e. the default case) is optimal when controlling for speed on modern machine learning accelerators.

Published as a conference paper at ICLR 2021

def compute position lambdas(embeddings, values, impl= einsum ):

"""Compute position lambdas with intra depth u.""" if impl == conv :

# values: [b, n, v, u] shape # embeddings: [r, 1, u, k] shape position lambdas = conv2d(values, embeddings) # Transpose from shape [b, n, v, k] to shape [b, n, k, v] position lambdas = transpose(position lambdas, [0, 1, 3, 2]) elif impl == einsum :

# embeddings: [k, n, m, u] shape position lambdas = einsum(embeddings, values, knmu,bmvu >bnkv ) return position lambdas

def lambda layer(queries, keys, embeddings, values, impl= einsum ):

"""Multi query lambda layer with intra depth u.""" content lambda = einsum(softmax(keys), values, bmku,bmvu >bkv ) position lambdas = compute position lambdas(embeddings, values, lambda conv) content output = einsum(queries, content lambda, bhnk,bkv >bnhv ) position output = einsum(queries, position lambdas, bhnk,bnkv >bnhv ) output = reshape(content output + position output, [b, n, d]) return output

Figure 8: Pseudo-code for the multi-query lambda layer with intra-depth |u|. Lambdas are obtained by reducing over the context positions and the intra-depth dimension. This variant allocates more computation for generating the lambdas while keeping the cost of applying them constant. The equivalent n-d lambda convolution can be implemented with a regular (n+1)-d convolution.

Operation Time complexity Space complexity

Lambda layer (|u| >1) Θ(bnmkud/h) Θ(knmu + bnkv)

Table 7: Complexity for a multi-query lambda layer with intra-depth |u|.

D ADDITIONAL RELATED WORK

In this section, we review the attention operation and related works on improving its scalability. We discuss connections between lambda layers and channel, spatial or linear attention mechanisms and show how they can be cast as less ﬂexible speciﬁc instances of lambda layers. We conclude with a brief review of self-attention in the visual domain and discuss connections with expert models.

D.1 SOFTMAX ATTENTION

Softmax attention Softmax-attention produces a distribution over the context for each query qn as an = softmax(Kqn) R|m| where the keys K are obtained from the context C. The attention distribution an is then used to form a linear combination of values V obtained from the context as yn = V T an = P

m anmvm R|v|. As we take a weighted sum of the values12, we transform the query qn into the output yn and discard its attention distribution an. This operation captures content-based interactions, but not position-based interactions.

Relative attention In order to model position-based interactions, relative attention (Shaw et al., 2018) introduces a learned matrix of |m| positional embeddings En R|m| |k| and computes the attention distribution as an = softmax((K + En)qn) R|m|. The attention distribution now also depends on the query position n relative to positions of context elements m. Relative attention therefore captures both content-based and position-based interactions.

12Sometimes the attention operation is instead used to point to speciﬁc context elements (Vinyals et al., 2015; Bello et al., 2016), which is not supported by lambda layers.

Published as a conference paper at ICLR 2021

D.2 SPARSE ATTENTION

A signiﬁcant challenge in applying (relative) attention to large inputs comes from the quadratic Θ(|bnm|) memory footprint required to store attention maps. Many recent works therefore propose to impose speciﬁc patterns to the attention maps as a means to reduce the context size |m| and consequently the memory footprint of the attention operation. These approaches include:

local attention patterns (Dai et al., 2019; Parmar et al., 2018; Ramachandran et al., 2019)

axial attention patterns (Ho et al., 2019; Wang et al., 2020a; Shen et al., 2020)

static sparse attention patterns (Child et al.; Beltagy et al., 2020)

dynamic sparse attention patterns (Kitaev et al., 2020)

See Tay et al. (2020) for a review. Their implementations can be rather complex, sometimes require low-level kernel implementations to get computational beneﬁts or may rely on speciﬁc assumptions on the shape of the inputs (e.g., axial attention). In contrast, lambda layers are simple to implement for both global and local contexts using simple einsum and convolution primitives and capture dense content and position-based interactions with no assumptions on the input shape.

D.3 LINEAR ATTENTION: CONNECTIONS AND DIFFERENCES

Another approach to reduce computational requirements of attention mechanisms consists in approximating the attention operation in linear space and time complexity, which is referred to as linear (or efﬁcient) attention. Linear attention mechanisms date back to de Br ebisson & Vincent (2016); Britz et al. (2017) and were later introduced in the visual domain by Chen et al. (2018); Shen et al. (2018). They are recently enjoying a resurgence of popularity with many works modifying the popular Transformer architecture for sequential processing applications (Katharopoulos et al., 2020; Wang et al., 2020b; Choromanski et al., 2020; Xiong et al., 2021).

Linear attention via kernel factorization Linear attention is typically obtained by reinterpreting attention as a similarity kernel and leveraging a low-rank kernel factorization as

Attention(Q, K, V ) = softmax(QKT )V φ(Q)(φ(KT )V ) (3)

for some feature function φ. Computing φ(KT )V R|k| |v| ﬁrst bypasses the need to materialize the attention maps φ(Q)φ(KT ) and the operation therefore has linear complexity with respect to the input length |n|.

Multiple choices for the feature function φ have been proposed. For example, Katharopoulos et al. (2020) use φ(x) = elu(x) + 1, while Choromanski et al. (2020) use positive orthogonal random features to approximate the original softmax attention kernel. In the visual domain, both Chen et al. (2018) and Shen et al. (2018) use φ(x) = softmax(x). This choice is made to guarantee that the rows of the (non-materialized) attention maps φ(Q)φ(K)T sum to 1 as is the case in the regular attention operation.

We discuss the main differences between lambda layers and linear attention mechanisms.

1) Lambda layers extend linear attention to also consider position-based interactions. The kernel approximation from Equation 3 can be rewritten for a single query qn as

yn = (φ(K)T V )T φ(qn) (4)

which resembles the output of the content lambda yc n = (λc)T qn = ( KT V )T qn from Equation 1. Lambda layers extend linear attention mechanisms to also consider position-based interactions as

yn = λT nqn = (λc + λp n)T qn = (( K + En)T V )T qn (5)

In the above equation, computing the position (or content) lambda has Θ(bmkv) time complexity. As the position lambdas are not shared across query positions n, this cost is repeated for all |n| queries, leading to a total time complexity Θ(bnmkv). Unlike linear attention mechanisms, lambda layers have quadratic time complexity with respect to the input length (in the global context case) because they consider position-based interactions.

Published as a conference paper at ICLR 2021

2) Lambda layers do not necessarily attempt to approximate an attention kernel. While approximations of the attention kernel are theoretically motivated, we argue that they may be unnecessarily restrictive. For example, the kernel approximation in Equation 3 requires the same feature function φ on both Q and K and precludes the use of more ﬂexible non-linearities and normalization schemes. In contrast, lambda layers do not attempt to approximate an attention kernel. This simpliﬁes their design and allows for more ﬂexible non-linearity and normalization schemes, which we ﬁnd useful in our ablations (See Table 11 in Appendix E.1). Considering the position embeddings independently of the keys notably enables a simple and efﬁcient local implementation with the lambda convolution. Approximating the relative attention kernel would require normalizing the position embeddings with the keys (i.e., φ(K + En) instead of φ(K) + En), which cannot be implemented in the local context case with a convolution.

3) The lambda abstraction reveals the computational beneﬁts of the multi-query formulation. Finally, this work proposes to abstract the KT V and ET n V matrices as linear functions (the content and position lambdas) that are directly applied to the queries. The lambda abstraction reveals the beneﬁts of multi-query formulation (as opposed to the traditional multi-head attention formulation) as a means to reduce computational costs.

D.4 CASTING CHANNEL AND SPATIAL ATTENTION AS LAMBDA LAYERS.

We show that the lambda abstraction generalizes channel and spatial attention mechanisms, both of which can be viewed as speciﬁc instances of lambda layers. This observation is consistent with our experiments which demonstrate that lambda layers outperform both channel and spatial attention while being more computationally efﬁcient.

Channel attention Channel attention mechanisms, such as Squeeze-and-Excitation (SE) (Hu et al., 2018c;b) and Fi LM layers (Perez et al., 2017), recalibrate features via cross-channel interactions by aggregating signals from the entire feature map. In particular, the SE operation can be written as ynk = wkqnk where wk is the excitation weight for channel k in the query qn. This can be viewed as using a diagonal lambda which is shared across query positions λn = diag(w1 w|k|). Channel attention mechanisms have proven useful to complement convolutions but cannot be used as a stand-alone layer as they discard spatial information.

Spatial attention Conversely, spatial attention mechanisms, reweigh each position based on signals aggregated from all channels (Xu et al., 2015; Park et al., 2018; Woo et al., 2018). These mechanisms can be written as ynk = wnqnk where wn is the attention weight for position n in the input query Q. This can be viewed as using (position-dependent) scalar lambdas λn = wn I where I is the identity matrix. Spatial attention has also proven helpful to complement convolutions but cannot be used as a stand-alone layer as it discards channel information.

D.5 SELF-ATTENTION IN THE VISUAL DOMAIN

Self-attention has been used in a myriad of tasks in the visual domain. These include image classiﬁcation (Bello et al., 2019; Ramachandran et al., 2019; Cordonnier et al., 2019; Zhao et al., 2020; Wu et al., 2020; Dosovitskiy et al., 2020); object detection and object-centric tasks (Wang et al., 2018; Hu et al., 2018a; Carion et al., 2020; Locatello et al., 2020); video tasks (Sun et al., 2019; Liao et al., 2019); autoregressive/adversarial generative modeling (Parmar et al., 2018; Zhang et al., 2019; Brock et al., 2019; Chen et al., 2020a) and multi-modal text-vision tasks (Chen et al., 2020b; Lu et al., 2019; Li et al., 2019; Radford et al., 2021)

The ﬁrst use of self-attention in vision dates back to the non-local block (Wang et al., 2018), which added a single-head global self-attention residual in the low resolution stages of a Conv Net for longrange dependency modeling. The non-local block has proven useful to complement convolutions but cannot be used as a stand-alone layer as it does not model position-based interactions.

Global relative attention replaces convolutions at low resolution. Bello et al. (2019) introduced a 2d relative attention mechanism that proved competitive as a replacement to convolutions but gives even stronger results when used to concatenate convolutional features with self-attention features. The spatial convolutions in the bottleneck block of the Res Net architecture were replaced with a

Published as a conference paper at ICLR 2021

global multi-head self-attention mechanism with 2d relative position embeddings. Due to the large memory constraints of global attention, this operation was restricted to low resolution feature maps and the proposed architecture was a conv-transformer hybrid.

A similar hybrid design has recently been revisited by Srinivas et al. (2021) using modern training and scaling techniques. Srinivas et al. (2021), rather than concatenating convolutional feature maps, propose to use a stride of 1 in the last stage of the Res Net architecture for improved performance.

Local/axial relative attention replaces convolutions at high resolution. The large memory footprint of global attention was quickly solved by multiple works which proposed to limit the size of the attention contexts such as local attention (Ramachandran et al., 2019; Hu et al., 2019) and axial attention (Ho et al., 2019; Wang et al., 2020a; Shen et al., 2020) (See Section D.2). Such approaches enable using attention at higher resolution and facilitate fully-attentional models but can be slow due to the use of specialized attention patterns.

Scaling trumps inductive bias Concurrently to this work, Vi T (Dosovitskiy et al., 2020) propose to simply apply attention on pixel patches (as opposed to individual pixels) as a remedy to large memory requirements. While patch-based attention does not maintain accurate positional information or translation equivariance, the loss of inductive bias is recovered by pre-training on large-scale datasets (e.g. 300M images). Most remarkably, Vi T achieves close to state-of-the-art accuracy when ﬁne-tuned on the Image Net dataset, while requiring less training compute that convolutional alternatives (Kolesnikov et al., 2020; Xie et al., 2020). This result has reinvigorated interest in using self-attention in the visual domain with multiple follow-up works already building upon this approach (Touvron et al., 2021)13. In spite of the impressive image classiﬁcation results, concerns remain as to whether the patch-based approach can scale to larger images and transfer to tasks that require precise localization such as detection.

We stress that reducing memory by working with pixel patches is orthogonal to the speciﬁc operation used and that lambda layers (or linear attention) can also be used on pixel patches.

D.6 HYPERNETWORKS, EXPERT MODELS AND CONTEXT-DEPENDENT WEIGHTS

Lambda Networks generate their own computations, i.e. lambdas such that yn = λT nqn. As such, they can alternatively be viewed as an extension of Hyper Networks (Ha et al., 2016) that dynamically generate their computations based on structured contextual information. The concept of generating context-dependent weights is also related to fast weights (Ba et al., 2016).

Lastly, Lambda Networks share some connections with sparsely-activated expert models (Shazeer et al., 2017; Fedus et al., 2021). Whereas sparsely-activated expert models select the computation (i.e. the lambda) from a bank of weights based on the input query, Lambda Networks generate their computations based on contextual information.

13Most follow-up works advertise improvements over Vi T on smaller datasets which is not the intended purpose of Vi T.

Published as a conference paper at ICLR 2021

E ADDITIONAL EXPERIMENTS

E.1 ABLATION STUDY

We perform several ablations and validate the importance of positional interactions, long-range interactions and ﬂexible normalization schemes. Unless speciﬁed otherwise, all experimental results in this section report Image Net accuracies obtained by training a Lambda Network architecture that replaces the spatial convolutions in the Res Net-50 with lambda layers.

Varying query depth, number of heads and intra-depth. Table 8 presents the impact of the query depth |k|, number of heads |h| and intra depth |u| on performance (See Appendix C.4 for a presentation of the intra-depth |u|). Our experiments indicate that the lambda layer outperforms convolutional and attentional baselines for a wide range of hyperparameters, demonstrating the robustness of the method.

|k| |h| |u| Params (M) top-1

Res Net baseline 25.6 76.9

8 2 1 14.8 77.2 8 16 1 15.6 77.9

2 4 1 14.7 77.4 4 4 1 14.7 77.6 8 4 1 14.8 77.9 16 4 1 15.0 78.4 32 4 1 15.4 78.4

2 8 1 14.7 77.8 4 8 1 14.7 77.7 8 8 1 14.7 77.9 16 8 1 15.1 78.1 32 8 1 15.7 78.5

8 8 4 15.3 78.4 8 8 8 16.0 78.6 16 4 4 16.0 78.9

Table 8: Ablations on the Image Net classiﬁcation task when using the lambda layer in a Res Net50 architecture. All conﬁgurations outpeform the convolutional baseline at a lower parameter cost. As expected, we get additional improvements by increasing the query depth |k| or intra-depth |u|. The number of heads is best set to intermediate values such as |h|=4. A large number of heads |h| excessively decreases the value depth |v| = d/|h|, while a small number of heads translates to too few queries, both of which hurt performance.

Content vs position interactions Table 9 presents the relative importance of content-based and position-based interactions on the Image Net classiﬁcation task. We ﬁnd that position-based interactions are crucial to reach high accuracies, while content-based interactions only bring marginal improvements over position-based interactions14.

Content Position Params (M) FLOPS (B) top-1

14.9 5.0 68.8 14.9 11.9 78.1 14.9 12.0 78.4

Table 9: Contributions of content and positional interactions. As expected, positional interactions are crucial to perform well on the image classiﬁcation task.

14This observation is challenged by concurrent work (Dosovitskiy et al., 2020) which demonstrates that content-based interactions can be sufﬁcient for image classiﬁcation when pre-training on large scale datasets (e.g. 300M images).

Published as a conference paper at ICLR 2021

Importance of scope size The small memory footprint of Lambda Networks enables considering global contexts, even at relatively high resolution. Table 10 presents ﬂops counts and top-1 Image Net accuracies when varying scope sizes in a Lambda Network architecture. We ﬁnd beneﬁts from using larger scopes, with a plateau around |m|=15x15, which validates the importance of longer range interactions compared to the usual 3x3 spatial convolutions used in the Res Net architecture. In our main experiments, we choose |m|=23x23 as the default to account for experiments that use larger image sizes.

Scope size |m| 3x3 7x7 15x15 23x23 31x31 global

FLOPS (B) 5.7 6.1 7.8 10.0 12.4 19.4 Top-1 Accuracy 77.6 78.2 78.5 78.3 78.5 78.4

Table 10: Impact of varying the scope size for positional lambdas on the Image Net classiﬁcation task. We replace the 3x3 spatial convolutions in the last 2 stages of a Res Net-50 with lambda layers (input image size is 224x224). Flops signiﬁcantly increase with the scope size, however we stress that larger scopes do not translate to slower latencies when using the einsum implementation (see Figure 3).

Normalization Table 11 ablates normalization operations in the design of the lambda layer. We ﬁnd that normalizing the keys is crucial for performance and that other normalization functions besides the softmax can be considered. Applying batch normalization to the queries and values is also helpful.

Normalization top-1

Softmax on keys (default) 78.4 Softmax on keys & Softmax on queries 78.1 L2 normalization on keys 78.0 No normalization on keys 70.0

No batch normalization on queries and values 76.2

Table 11: Impact of normalization schemes in the lambda layer. Normalization of the keys along the context spatial dimension m, normalization of the queries along the query depth k.

E.2 HYBRID MODELS STUDY

In this section, we study hybrid designs that use standard convolutions to capture local contexts and lambda layers to capture global contexts.15

Where are lambda layers most useful? Table 12 presents the throughputs and accuracies of hybrid Lambda Network architectures as a function of the location of convolutions and lambda layers in a Res Net-50 architecture. We observe that lambda layers are most helpful in the last two stages (commonly referred to as c4 and c5) when considering their speed-accuracy tradeoff. We refer to architectures that replaces 3x3 convolutions in the last 2 stages of the Res Net with lambda layers as Lambda Res Net-C4.

Further pushing the speed-accuracy Pareto frontier. In Table 13, we further study how throughput and accuracy are impacted by the number of lambda layers in the c4 stage. Our results reveal that most beneﬁts from lambda layers can be obtained by (a) replacing a few 3x3 convolutions with lambda layers in the c4 stage and (b) replacing all 3x3 convolutions in c5. The resulting hybrid Lambda Res Nets architectures have increased representational power at a virtually negligible decrease in throughput compared to their vanilla Res Net counterparts. Table 19 presents the detailed block conﬁgurations and placement of lambda layers for our family of Lambda Res Nets.

15We could alternatively use the lambda convolution to capture local contexts.

Published as a conference paper at ICLR 2021

Architecture Params (M) Throughput top-1

C C C C 25.6 7240 ex/s 76.9 L C C C 25.5 1880 ex/s 77.3 L L C C 25.0 1280 ex/s 77.2 L L L C 21.7 1160 ex/s 77.8

L L L L 15.0 1160 ex/s 78.4 C L L L 15.1 2200 ex/s 78.3 C C L L 15.4 4980 ex/s 78.3 C C C L 18.8 7160 ex/s 77.3

Table 12: Hybrid models achieve a better speed-accuracy trade-off. Inference throughput and top-1 accuracy as a function of lambda (L) vs convolution (C) layers placement in a Res Net50 architecture on 224x224 inputs. Lambda layers in the c5 stage incur almost no speed decrease compared to standard 3x3 convolutions. Lambda layers in the c4 stage are relatively slower than standard 3x3 convolutions but yield signiﬁcant accuracy gains.

Conﬁg Image size Params (M) Throughput top-1

Res Net-101 wo/ SE 224 44.6 4600 ex/s 81.3 Res Net-101 w/ SE 224 63.6 4000 ex/s 81.8 Lambda Res Net-101 224 36.9 4040 ex/s 82.3 Lambda Res Net-101-C4 224 26.0 2560 ex/s 82.6

Res Net-152 wo/ SE 256 60.2 2780 ex/s 82.5 Res Net-152 w/ SE 256 86.6 2400 ex/s 83.0 Lambda Res Net-152 256 51.4 2400 ex/s 83.4 Lambda Res Net-152-C4 256 35.1 1480 ex/s 83.4

Table 13: Impact of number of lambda layers in the c4 stage of Lambda Res Nets. Most beneﬁts from lambda layers can be obtained by having a few lambda layers in the c4 stage. Such hybrid designs maximize the speed-accuracy tradeoff. Lambda Res Net-C4 architectures exclusively employ lambda layers in c4 and c5. Lambda Res Net block conﬁgurations can be found in Table 19. Models are trained for 350 epochs on the Image Net classiﬁcation task.

Comparing hybrid lambda vs attention models. The memory savings of lambda layers compared to attention are less signiﬁcant in the aforementioned hybrid design, since the operations occur at lower resolution. Therefore, it is natural to ask whether lambda layers still have beneﬁts over self-attention when considering hybrid designs. We consider our largest hybrid as an example (see Table 19). Lambda Res Net-420 is trained on 320x320 inputs, employs 8 lambda layers in c4 and can ﬁt 32 examples per TPU-v3 core. This adds up to a cost of 38.4MB for lambda layers (4.8MB if sharing positional embeddings), whereas using attention layers instead would incur 0.625GB. The increase might not be signiﬁcant in practice and it will be interesting to carefully benchmark the hybrid attention variants16. We point that experiments from Table 4 suggest that the beneﬁts of lambda layers go beyond improved scalability and stress that the memory savings are more pronounced for tasks that require larger inputs such as object detection.

E.3 OBJECT DETECTION AND INSTANCE SEGMENTATION RESULTS

In Table 14, we evaluate Lambda Res Nets as a backbone in Mask-RCNN (He et al., 2017) on the COCO object detection and instance segmentation tasks. Using lambda layers yields consistent gains across all object sizes, especially the small objects which are the hardest to locate. This indicates that lambda layers are also competitive for more complex visual tasks that require localization information.

16We will benchmark such architectures in a future version of this draft.

Published as a conference paper at ICLR 2021

Backbone APbb coco APbb s/m/l APmask coco APmask s/m/l

Res Net-101 48.2 29.9 / 50.9 / 64.9 42.6 24.2 / 45.6 / 60.0 Res Net-101 + SE 48.5 (+0.3) 29.9 (+0.0) / 51.5 / 65.3 42.8 (+0.2) 24.0 (-0.2) / 46.0 / 60.2 Lambda Res Net-101 49.4 (+1.2) 31.7 (+1.8) / 52.2 / 65.6 43.5 (+0.9) 25.9 (+1.7) / 46.5 / 60.8

Res Net-152 48.9 29.9 / 51.8 / 66.0 43.2 24.2 / 46.1 / 61.2 Res Net-152 + SE 49.4 (+0.5) 30.0 (+0.1) / 52.3 / 66.7 43.5 (+0.3) 24.6 (+0.4) / 46.8 / 61.8 Lambda Res Net-152 50.0 (+1.1) 31.8 (+1.9) / 53.4 / 67.0 43.9 (+0.7) 25.5 (+1.3) / 47.3 / 62.0

Table 14: COCO object detection and instance segmentation with Mask-RCNN architecture on 1024x1024 inputs. We compare Lambda Res Nets against Res Nets with or without squeeze-andexcitation (SE) and report Mean Average Precision (AP) for small, medium, large objects (APs/m/l). Using lambda layers yields consistent gains across all object sizes, especially small objects.

E.4 PARAMETER AND FLOPS EFFICIENCY RESULTS

E.4.1 COMPUTATIONAL EFFICIENCY COMPARISONS TO LARGE EFFICIENTNETS

In Table 15 and Table 16, we showcase the parameter and ﬂops-efﬁciency of Lambda Networks. We ﬁnd that Lambda Res Net-C4 which replaces the 3x3 convolutions in the last 2 stages of the Res Net architecture, where they incur the highest parameter costs, improves upon parameter and ﬂops efﬁciency of large Efﬁcient Nets. These results are signiﬁcant because Efﬁcient Nets were speciﬁcally designed by neural architecture search (Zoph & Le, 2017) to minimize computational costs using highly computationally efﬁcient depthwise convolutions (Tan & Le, 2019).

Architecture Image size Params (M) top-1

Efﬁcient Net-B6 528x528 43 84.0 Lambda Res Net-152-C4 320x320 35 84.0 Lambda Res Net-200-C4 320x320 42 84.3

Table 15: Parameter-efﬁciency comparison between Lambda Res Net-C4 and Efﬁcient Net-B6. Lambda Res Net-C4 is more parameter-efﬁcient in spite of using a smaller image size. Increasing the image size would likely result in improved accuracy while keeping the number of parameters ﬁxed. Models are trained for 350 epochs.

Architecture Image size Flops (G) top-1

Efﬁcient Net-B6 528x528 38 84.0 Lambda Res Net-270-C4 (|m|=7x7) 256x256 34 84.0

Table 16: Flops-efﬁciency comparison between Lambda Res Net-C4 and Efﬁcient Net-B6. We use smaller local scopes (|m|=7x7) to reduce FLOPS in the lambda layers. Models are trained for 350 epochs.

E.4.2 LAMBDA LAYERS IN A RESOURCE CONSTRAINED SCENARIO

Lastly, we brieﬂy study lambda layers in a resource-constrained scenario using the Mobile Netv2 architecture (Sandler et al., 2018). Mobile Nets (Howard et al., 2017; Sandler et al., 2018; Howard et al., 2019) employ lightweight inverted bottleneck blocks which consist of the following sequence: 1) a pointwise convolution for expanding the number of channels, 2) a depthwise convolution for spatial mixing and 3) a ﬁnal pointwise convolution for channel mixing. The use of a depthwise convolution (as opposed to a regular convolution) reduces parameters and ﬂops, making inverted bottlenecks particularly well-suited for embedded applications.

Lightweight lambda block. We construct a lightweight lambda block as follows. We replace the depthwise convolution in the inverted bottleneck with a lambda convolution with small scope size |m|=5x5, query depth |k|=32, number of heads |h|=4. We also change the ﬁrst pointwise

Published as a conference paper at ICLR 2021

convolution to output the same number of channels (instead of increasing the number of channels) to further reduce computations.

Adding lambda layers in Mobile Netv2. We wish to assess whether lambda layers can improve the ﬂops-accuracy (or parameter-accuracy) tradeoff of mobilenet architectures. We experiment with a simple strategy of replacing a few inverted bottlenecks with our proposed lightweight lambda block, so that the resulting architectures have similar computational demands as their baselines. A simple procedure of replacing the 10-th and 16-th inverted bottleneck blocks with lightweight lambda blocks in the Mobile Net-v2 architecture reduces parameters and ﬂops by 10% while improving Image Net accuracy by 0.6%. This suggest that lambda layers may be well suited for use in resource constrained scenarios such as embedded vision applications (Howard et al., 2017; Sandler et al., 2018; Howard et al., 2019).

Architecture Params (M) FLOPS (M) top-1

Mobile Net-v2 3.50 603 72.7 Mobile Net-v2 with 2 lightweight lambda blocks 3.21 563 73.3

Table 17: Lambda layers improve Image Net accuracy in a resource-constrained scenario. Replacing the 10-th and 16-th inverted bottleneck blocks with lightweight lambda blocks in the Mobile Net-v2 architecture reduces parameters and ﬂops by 10% while improving Image Net accuracy by 0.6%.

Published as a conference paper at ICLR 2021

F EXPERIMENTAL DETAILS

F.1 DETAILED LAMBDARESNETS RESULTS

Depth Image size Latency (s) Supervised top-1 Pseudo-labels top-1

50 128 0.058 77.4 82.1 50 160 0.089 79.2 83.4 101 160 0.14 80.8 84.7 101 192 0.20 81.9 85.4 152 192 0.28 82.5 86.1 152 224 0.38 83.2 86.5 152 256 0.49 83.8 - 152 288 0.63 - 86.7 270 256 0.91 84.2 - 350 256 1.16 84.4 - 350 288 1.48 84.5 - 350 320 1.91 84.7 - 420 320 2.25 84.9 -

Table 18: Detailed Lambda Res Nets results. Latency refers to the time per training step for a batch size of 1024 on 8 TPU-v3 cores using bfloat16 activations.

F.2 ARCHITECTURAL DETAILS

Lambda layer implementation details Unless speciﬁed otherwise, all lambda layers use query depth |k|=16, |h|=4 heads and intra-depth |u|=1. The position lambdas are generated with local contexts of size |m|=23x23 and the content lambdas with the global context using the einsum implementation as described in Figure 3. Local positional lambdas can be implemented interchangeably with the lambda convolution or by using the global einsum implementation and masking the position embeddings outside of the local contexts (Figure 5). The latter can be faster but has higher FLOPS and memory footprint due to the Θ(knm) term (see Table 2). In our experiments, we use the convolution implementation only for input length |n| > 852 or intra-depth |u| > 1. When the intra-depth is increased to |u| >1, we switch to the convolution implementation and reduce the scope size to |m|=7x7 to reduce ﬂops.

Positional embeddings are initialized at random using the unit normal distribution N(0, 1). We use fan-in initialization for the linear projections in the lambda layer. The projections to compute K and V are initialized at random with the N(0, |d| 1/2) distribution. The projection to compute Q is initialized at random with the N(0, |kd| 1/2) distribution (this is similar to the scaled dotproduct attention mechanism, except that the scaling is absorbed in the projection). We apply batch normalization on Q and V and the keys K are normalized via a softmax operation.

Res Nets. We use the Res Net-v1 implementation and initialize the γ parameter in the last batch normalization (Ioffe & Szegedy, 2015) layer of the bottleneck blocks to 0. Squeeze-and-Excitation layers employ a squeeze ratio of 4. Similarly to Res Net-RS (Bello et al., 2021), we use the Res Net D (He et al., 2018) and additionally replace the max pooling layer in the stem by a strided 3x3 convolution. Our block allocation and scaling strategy (i.e. selected resolution as a function of model depth) also follow closely the scaling recommendations from Res Net-RS (Bello et al., 2021).

Lambda Res Nets. We construct our Lambda Res Nets by replacing the spatial 3x3 convolutions in the bottleneck blocks of the Res Net-RS architectures by our proposed lambda layer, with the exception of the stem which is left unchanged. We apply 3x3 average-pooling with stride 2 after the lambda layers to downsample in place of the strided convolution. Lambda layers are uniformly spaced in the c4 stage and all bottlenecks in c5 use lambda layers. Table 19 presents the exact block conﬁguration and the location of the lambda layers for our hybrid Lambda Res Nets. We do not use squeeze-and-excitation in the bottleneck blocks that employ a lambda layer instead of the standard 3x3 convolution.

Published as a conference paper at ICLR 2021

Model Block Conﬁguration Lambda layers in c4

Lambda Res Net-50 [3-4-6-3] 3 Lambda Res Net-101 [3-4-23-3] 6, 12, 18 Lambda Res Net-152 [3-8-36-3] 5, 10, 15, 20, 25, 30 Lambda Res Net-200 [3-24-36-3] 5, 10, 15, 20, 25, 30 Lambda Res Net-270 [4-29-53-4] 8, 16, 24, 32, 40, 48 Lambda Res Net-350 [4-36-72-4] 10, 20, 30, 40, 50, 60 Lambda Res Net-420 [4-44-87-4] 10, 20, 30, 40, 50, 60, 70, 80

Table 19: Block conﬁgurations and lambda layers placement of Lambda Res Nets in the Pareto curves. Lambda Res Nets use the block allocations from He et al. (2016); Bello et al. (2021).

F.3 TRAINING DETAILS

Image Net training setups. We consider two training setups for the Image Net classiﬁcation task. The 90 epochs training setup trains models for 90 epochs using standard preprocessing and allows for fair comparisons with classic works. The 350 epochs training setup trains models for 350 epochs using improved data augmentation and regularization and is closer to training methodologies used in modern works with state-of-the-art accuracies.

Supervised Image Net 90 epochs training setup with vanilla Res Net. In the 90 epoch setup, we use the vanilla Res Net for fair comparison with prior works. We used the default hyperparameters as found in ofﬁcial implementations without doing additional tuning. All networks are trained endto-end for 90 epochs via backpropagation using SGD with momentum 0.9. The batch size B is 4096 distributed across 32 TPUv3 cores (Jouppi et al., 2017) and the weight decay is set to 1e-4. The learning rate is scaled linearly from 0 to 0.1B/256 for 5 epochs and then decayed using the cosine schedule (Loshchilov & Hutter, 2017). We use batch normalization with decay 0.9999 and exponential moving average with weight 0.9999 over trainable parameters and a label smoothing of 0.1. The input image size is set to 224x224. We use standard training data augmentation (random crops and horizontal ﬂip with 50% probability).

Most works compared against in Table 3 use a similar training setup and also replace the 3x3 spatial convolutions in the Res Net architecture by their proposed methods. We note that Ramachandran et al. (2019) train for longer (130 epochs instead of 90) but do not use label smoothing which could confound our comparisons.

Supervised Image Net 350 epochs training setup. Higher accuracies on Image Net are commonly obtained by training longer with increased augmentation and regularization (Lee et al., 2020; Tan & Le, 2019). Similarly to Bello et al. (2021), the weiht decay is reduced to 4e-5 and we employ Rand Augment (Cubuk et al., 2019) with 2 layers, dropout (Srivastava et al., 2014) and stochastic depth (Huang et al., 2016). See Table 20 for exact hyperparameters. All architectures are trained for 350 epochs with a batch size B of 4096 or 2048 distributed across 32 or 64 TPUv3 cores, depending on memory constraints.

We tuned our models using a held-out validation set comprising 2% of the Image Net training set (20 shards out of 1024). We perform early stopping on the held-out validation set for the largest models, starting with Lambda Res Net-350 at resolution 288x288, and simply report the ﬁnal accuracies for the smaller models.

Semi-supervised learning with pseudo-labels. Our training setup closely follows the experimental setup from Xie et al. (2020). We use the same dataset of 130M ﬁltered and balanced JFT images with pseudo-labels generated by an Efﬁcient Net-L2 model with 88.4% Image Net accuracy. Hyperparameters are the same as for the supervised Image Net 350 epochs experiments.

Latency measurements. Figure 4 reports training latencies (i.e. time per training step) to process a batch of 1024 images on 8 TPUv3 cores using mixed precision training (ı.e bfloat16 activations). Training latency is originally measured on 8 TPUv3 cores, starting with a total batch size of 1024 (i.e. 128 per core) and dividing the batch size by 2 until it ﬁts in memory. We then report the normalized latencies in Figure 4. For example, if latency was measured with a batch size of 512

Published as a conference paper at ICLR 2021

Depth Image Size Rand Augment magnitude Dropout Stochastic depth rate

50 128 10 0.2 0 50 160 10 0.2 0 101 160 10 0.3 0 101 192 15 0.2 0 152 192 15 0.3 0 152 224 15 0.3 0.1 152 256 15 0.3 0.1 152 288 15 0.3 0.1 270 256 15 0.3 0.1 350 256 15 0.3 0.2 350 288 15 0.3 0.2 350 320 15 0.3 0.2 420 320 15 0.3 0.2

Table 20: Hyperparameters used to train Lambda Res Nets. We train for 350 epochs with Rand Augment, dropout and stochastic depth.

(instead of 1024), we normalize the reported latency by multiplying the measured latency by 2. Table 4, Table 12 and Table 13 report inference throughput on 8 TPUv3 cores using full precision (i.e. float32 activations). Latency for Vi T (Dosovitskiy et al., 2020) was privately communicated by the authors.

FLOPS count. We do not count zeroed out ﬂops when computing positional lambdas with the einsum implementation from Figure 3. Flops count is highly dependent on the scope size which is rather large by default (|m|=23x23). In Table 10, we show that it is possible to signiﬁcantly reduce the scope size and therefore FLOPS at a minimal degradation in performance.

COCO object detection. We employ the architecture from the improved Image Net training setup as the backbone in the Mask-RCNN architecture. All models are trained on 1024x1024 images from scratch for 130k steps with a batch size of 256 distributed across 128 TPUv3 cores with synchronized batch normalization. We apply multi-scale jitter of [0.1, 2.0] during training. The learning rate is warmed up for 1000 steps from 0 to 0.32 and divided by 10 at steps 90, 95 and 97.5% of training. The weight decay is set to 4e-5.

Mobilenet training setup. All mobilenet architectures are trained for 350 epochs on Imagenet with standard preprocessing at 224x224 resolution. We use the same hyperparameters as Howard et al. (2019). More speciﬁcally, we use RMSProp with 0.9 momentum and a batch size of 4096 split across 32 TPUv3 cores. The learning rate is warmed up linearly to 0.1 and then multiplied by 0.99 every 3 epochs. We use a weight decay 1e-5 and dropout with drop probability of 0.2