# switchhead_accelerating_transformers_with_mixtureofexperts_attention__ccb5105b.pdf Switch Head: Accelerating Transformers with Mixture-of-Experts Attention Róbert Csordás1 Piotr Pi ekos2 Kazuki Irie3 Jürgen Schmidhuber2,4 1Stanford University, Stanford, CA, USA 2AI Initiative, KAUST, Thuwal, Saudi Arabia 3Center for Brain Science, Harvard University, Cambridge, MA, USA 4The Swiss AI Lab IDSIA, USI & SUPSI, Lugano, Switzerland rcsordas@stanford.edu, piotr.piekos@kaust.edu.sa, kirie@fas.harvard.edu, juergen@idsia.ch Despite many recent works on Mixture of Experts (Mo Es) for resource-efficient Transformer language models, existing methods mostly focus on Mo Es for feedforward layers. Previous attempts at extending Mo E to the self-attention layer fail to match the performance of the parameter-matched baseline. Our novel Switch Head is an effective Mo E method for the attention layer that successfully reduces both the compute and memory requirements, achieving wall-clock speedup, while matching the language modeling performance of the baseline Transformer. Our novel Mo E mechanism allows Switch Head to compute up to 8 times fewer attention matrices than the standard Transformer. Switch Head can also be combined with Mo E feedforward layers, resulting in fully-Mo E Switch All Transformers. For our 262M parameter model trained on C4, Switch Head matches the perplexity of standard models with only 44% compute and 27% memory usage. Zero-shot experiments on downstream tasks confirm the performance of Switch Head, e.g., achieving more than 3.5% absolute improvements on Bli MP compared to the baseline with an equal compute resource.1 1 Introduction Figure 1: A schematic representation of Switch Head. It consists of a few independent heads, each with multiple experts for value and output projections. Each head has a single attention matrix. Large language models (LLMs) have shown remarkable capabilities [1, 2, 3, 4] and great versatility [5]. However, training large Transformers [6, 7] requires a considerable amount of computing power and memory, which is not accessible to most researchers, academic institutions, and even companies. Work done at IDSIA. 1Our code is public: https://github.com/robertcsordas/switchhead 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Even running them in inference mode typically much less resource-intensive requires significant engineering effort [8]. Accelerating Transformers remains an important research question. In this context, Mixture of Experts (Mo E) layers [9, 10, 11] have become popular to efficiently scale up Transformers to a large number of parameters [12, 13, 14, 15, 16, 17]. However, most of these works mainly focus on applying Mo E to the 2-layer feedforward blocks [6], i.e., the multi-layer perceptron (MLP) components of the Transformer, while keeping the self-attention layers unchanged. Given that attention also accounts for a considerable amount of compute and memory usage in Transformers (especially for long context sizes), using Mo E for attention has potential to further improve resource efficiency in Transformers. While Mo E-based attention remains underexplored in general, there are existing works on Mo E approaches for attention [18, 19]. However, in practice, previously proposed methods typically require a lot of engineering tricks for successful training, and most importantly, only achieve a modest reduction in computing and memory requirements in the end (as we also confirm in our experiments). Here, we present a novel Mo E-based attention method, Switch Head, whose mechanism allows to reduce the number of attention matrices that need to be computed and stored. Following σ-Mo E [17], our method uses a non-competitive selection activation function (sigmoid), and does not require regularization or extra tricks for stable training. Importantly, we show that it is possible to compute the Mo E projections outside of the attention core, which enables a significant reduction in the number of computed attention maps, resulting in significant resource savings. Our thorough investigation shows that it is enough to choose the value and output projections from a pool of experts and share keys and queries between them. We evaluate our method on C4 [20], Enwik8 [21], pe S2o [22] and Wikitext 103 [23], with two model sizes (47M and 262M). Additionally, we measure the zero-shot performance of our main models on Lambada [24], BLi MP [25], and Children s Books Test [26] datasets. Our experiments demonstrate that Switch Head can achieve performance comparable to parameter-matched baselines with just a fraction of the compute and memory budget. In addition, we introduce Switch All , a fully Mo E-based Transformer model, that combines a σ-Mo E-based MLP layer with our Switch Head attention, often outperforming dense baselines with the same parameter budgets. Finally, we analyze the attention maps of our Switch Head. We find that the attention maps taken over all heads are qualitatively similar to the dense baselines, indicating a significant reduction in redundancy without a loss of expressivity. In addition, expert selections are often interpretable. 2.1 Background The standard multi-head self-attention (MHA) layer [6] consists of four major steps: (1) compute key, query, and value projections, (2) compute the attention matrix, (3) use the attention matrix to project the values, and (4) map the projected values to the output. Let h, T, nheads, dmodel, dhead denote positive integers. Let x RT dmodel denote an input to the MHA layer with nheads heads, T be the sequence length, and dmodel denote the size of the hidden representations of the model. W h {K,V,Q} Rdmodel dhead are the projection matrices for head h {1, ..., nheads}. Then Kh = x W h K, Qh = x W h Q, and V h = x W h V (thus Kh, Qh, V h RT dhead) are the keys, queries, and values, respectively. The attention matrix for the head h, Ah RT T , and the output y RT dmodel are calculated as follows: Ah = softmax 1 dhead Qh Kh (1) y = (A1V 1|A2V 2|...|Anheads V nheads)WO (2) where | denotes concatenation in the last dimension, the softmax( ) is also over the last dimension, and WO Rnheadsdhead dmodel. However, an alternative formulation reflects the role of WO better. Let us divide WO along the first dimension into submatrices for each head, W h O Rdhead dmodel, such that WO = W 1 O |W 2 O |...|W nheads O . In this case, the output (Eq. 2) can be equivalently written as: h Ah V h W h O (3) From this, it can be seen that all computations are local to each head. Computing the attention matrix Ah and the readout Ah V h requires compute in order of O(nheadsdhead T 2) MACs (multiplicationaccumulation operation2). During training, it requires the storage of O(nheads T 2) for the attention matrices and O(nheads Tdhead) for storing the sub-results of the projections. Given a sufficiently long sequence, computing the attention matrix and projecting the values will dominate the compute requirements due to the quadratic dependence on the sequence length T. 2.2 From Dense to Switch Head Attention Layer Our goal is to obtain resource reductions while maintaining the fundamental properties of attention and retaining a fully expressive attention matrix. For that, we start from the following observation: modern LLMs use tens of heads [2, 27]. Are so many of them all necessary? As we show later in Sec. 3, indeed, naively reducing the number of heads (while keeping the same number of parameters by increasing the head dimension) results in performance loss. Explaining the reason for the need for many heads is beyond the scope of this paper. Nevertheless, here are some hypotheses: (1) they provide multiple inputs for the operations that the network performs in each step, (2) they are specialized and provide inputs only for specific operations (in this case, each operation would use a different subset of heads), (3) they may provide diverse outputs due to different initializations, some being more successful than others, thus enabling better learning. Among these, (2) and (3) may offer an opportunity for resource savings: if not all heads are needed at the same time, it might be possible to switch among them depending on the context. One naive method to achieve this is to use a gating signal using a linear projection WS Rdmodel nheads, and use the heads with the highest score, by replacing Eq. 3 with Eq. 6: s = σ (x WS) (4) E = arg topk(s, k), E {1, ..., nheads} (5) y[t, c] = X h E s[t, h](Ah V h W h O)[t, c] (6) where y[t, c] R denotes indexing the specific element of the output matrix y RT dmodel, for timestep t and channel c, and k is the number of active experts. Following the σ-Mo E method [17], we use a non-competitive selection function (sigmoid σ in Eq. 4). Now, let us define the source side of attention as the keys and values and the destination side as the queries and output. Intuitively, the above method corresponds to choosing a subset of attention heads based on the destination side alone3. Our preliminary experiments confirmed that this method is indeed feasible for language modeling on Wiki Text-103. However, it is difficult to achieve acceleration and memory savings with this method. To see why, notice that the entries of the attention matrix Ah depend on pairs of tokens, one for the source and one for the destination side, but the choice is made only based on the destination side. Thus, in the worst case, for each destination, a different source might be chosen, in which case all possible source projections have to be computed for the keys and values, which we would like to avoid. Alternatively, we propose to improve the method above by introducing conditional computations for the source and destination projections independently of each other. That is, we parameterize each of key, query, value, output projection by an independent Mo E. This avoids conditional computations that involve the attention matrix itself. Our solution implements this using Mixtures of Experts (Mo Es). The concepts of "heads" are no longer well defined in the conventional sense: we redefine a head as an instance of a computed attention matrix. We call the total number of them nheads. For each head h, we define a separate list of E experts. The total number of experts is then nheads E. Then, the projection matrices become W h,e K , W h,e Q , W h,e V and W h,e O Rdhead dmodel, where h denotes the head index and e the specific expert. Then we compute the source-side expert selection as follows: sh S = σ(x W h S ) (7) Eh S = arg topk(sh S, k), Eh S {1, ..., E} (8) 2The number of MACs is a metric used in prior work [18], which is independent of both the specific hardware and implementation, unlike wall-clock time. For wall-clock-time measurements, see Sec. 3.7. 3To clarify, we allocate a routing function for each of key/value/query projections; these routing functions belong to the source or destination side accordingly. If we compare Eq. 10 and Eq. 6, one can notice that the routing function in Eq. 6 effectively corresponds to what we define as the destination-side routing in Eq. 10. where W h S Rdmodel E. We compute the destination-side experts similarly: sh D = σ(x W h D), Eh D = arg topk(sh D, k), Eh S {1, ..., E}, W h D Rdmodel E. Then, the value projection V h is computed as a weighted sum of the selected experts: sh S[e]x W h,e V (9) The key and query projections are computed similarly: Kh = P e Eh S sh S[e]x W h,e K , and Qh = P e Eh D sh D[e]x W h,e Q . The output projection also becomes an Mo E: sh D[e]Ah V h W h,e O (10) As we ll show, it is not necessary to make all projections Mo Es. In Section 3.1 we show that keeping a single, head-specific copy of the query and key projections and reusing them for all experts is beneficial. We call this method Switch Head. Essentially, Switch Head reduces the number of attention matrices that have to be computed (nheads) significantly, by using multiple experts per head. Note that our method does not depend on the specific implementation of the attention, allowing for easy experimentation and research. A schematic representation is shown in Figure 1. Table 1: Performance of Switch Head compared to different Mo A variants. Mo A can outperform the baseline, but only at a price of using significantly more compute and memory. Also, Switch Head outperforms the baseline dense Transformer. These results are on Wikitext 103. Table sorted by model perplexity. #total params Model nheads Perplexity MACs Mem (floats) 47M Switch Head 2 12.27 170.4M 0.8M Transformer 10 12.31 453.4M 3.5M Mo A 4 12.60 223.5M 1.3M Mo A 6 12.64 306.8M 1.9M Mo A 8 12.77 390.2M 2.6M Mo A 2 12.84 140.1M 0.7M 262M Mo A 8 9.50 2.9G 9.9M Switch Head 2 9.55 2.0G 2.9M Transformer 16 9.66 5.4G 21.0M Mo A 12 9.68 4.1G 14.7M Mo A 4 9.69 1.7G 5.1M Mo A 2 9.87 1.1G 2.7M 3 Experiments We conduct our experiments in a parameter-matched setting [17] which better reflects the task of language modeling (than the FLOPS-matched setting often used to evaluate Mo Es). Our main experiments use Transformer XL, because we found them to consistently and significantly outperform Ro PE-based baselines [28] for a fixed amount of compute. We provide the details of this analysis in Appendix A.4. The conclusions on the effectiveness of Switch Head are consistent in both cases. As an important specification, under this parameter-matched setting, we always configure Switchhead such that it matches the perplexity of the baseline dense Transformer, and we maximize its resource reductions. For this, we follow a systematic procedure. First, we set nheads E to be the same as nheads of the dense baseline. We start with setting nheads = 2 and k = 2, which provide the most resource reductions. If the resulting model underperforms, we increase k. If k = 4 underperforms as well, we set nheads = 4 and k = 2. We always set dhead so that the total number of parameters of the resulting model matches the number of parameters of the baseline. This reasonably simple procedure ensures a good amount of resource savings, while avoiding doing an expensive hyperparameter search. Note that all the perplexity gains seen in the main result tables are the byproduct of imperfect matching, and our goal is to achieve reductions in resource requirements, unless noted otherwise (See Sec. 3.5). Detailed hyperparameters of all our models can be found in Sec. A.5 in the Appendix. We use and adopt the Triton kernel of σ-Mo E [17] for our purposes. For all datasets except the character-level Enwik8 [21], we use sub-word units [29, 30] obtained with a Sentence Piece tokenizer [31] with a vocabulary size of 8k tokens. For most of our experiments, we use Transformer XL [32] with the context size being twice the size of the active/current chunk, because we found it to be significantly more resource-efficient than the standard setup. However, in order to show that our method is also competitive in the standard Transformer with Ro PE positional ecodings, we also demonstrate our main findings in this setup (Appendix A.4). All models are trained for 100k batches. Some of the datasets we consider (C4 [20], and pe S2o [22]) are much larger. In this case, we train on the first 105 T Nbatch tokens of the dataset. 3.1 Which Projections Require an Mo E? As discussed in Sec. 2.2, each linear projection (keys, values, queries, and output) can potentially be replaced independently by an Mo E. Here we first check which projection benefits from such a replacement. As we target the parameter-matched setting, using Mo E where it is not necessary can have a negative effect. Since experts use a significant part of the parameter budget, they can reduce the number of parameters available for the more useful parts of the model. Thus, we did a search over all possible combinations of Mo E versus fixed projections with two active heads and compared them to the parameter-matched baseline. We find that the output projection is necessary to match the performance of the baseline (for detailed results refer to Tab. 6 in the appendix). Having Mo E in the key and query projections turn out to be unnecessary. Models without the output and value Mo E underperform the dense baseline with nheads = 2 heads. In sum, the best-performing model is the one using Mo E for value and output projections. We use this model variant in the rest of experiments in this paper. 3.2 Comparison with Mo A The method most related to ours is the so-called Mixture of Attention Heads, or Mo A [18]. Unlike Switch Head, Mo A uses a single key and value projection and chooses nheads active query and output projections from a pool of E experts. Mo A computes the attention map for each selected expert and computes their weighted average after the attention computation takes place. In contrast, Switch Head calculates the weighted average of the K selected experts before and after attention computation. Because of this, in practice, the same perplexity is achieved with the required number of computed attention matrices (nheads) which is much lower for Switch Head compared to Mo A, allowing significant resource savings. Also, unlike Mo A, Switch Head uses a non-competitive activation function (sigmoid) [17]. We confirm that with this, our method performs well without any regularization, while Mo A requires three different regularizers. We compare our method with Mo A in Table 1. It can be seen that while Mo A can slightly outperform our method in terms of perplexity, it can only do so at the price of significantly more resource usage. Given a similar computation and memory budget, our method consistently outperforms Mo A. 3.3 Performance on Different Datasets We test our methods on a diverse set of language modeling datasets, including C4 [20], Enwik8 [21], pe S2o [22], at two different scales: a 47M and a 262M parameters. We chose this experimental setting taking into account our compute-budget and confidence in our results which are consistent in across various configurations. The results are shown in Table 2. We compare our models to two baselines: one with the same number of heads as the total number of experts (nheads E) of the Switch Head models, and the other has the same number of heads as the number of active attention matrices (nheads) as our models. Our Table 2: Performance of Switch Head compared to baselines on different datasets and model sizes. It can be seen that the predictive performance of our Switch Head model is comparable to the baselines, and is always better than the baseline with an equal number of heads. Perplexity is shown for Wikitext 103, C4 and pe S2o datasets, and bits/character (bpc) for Enwik8. Models sorted by perplexity. Dataset #total params Model nheads ppl/bpc MACs Mem (floats) C4 47M Switch Head 2 22.53 203M 0.8M Transformer 10 22.71 453M 3.5M Transformer 2 23.71 453M 1.4M 262M Switch Head 4 16.23 2.4G 5.6M Transformer 16 16.28 5.4G 21M Transformer 4 17.09 5.4G 8.4M Wikitext 103 47M Switch Head 2 12.31 170M 0.8M Transformer 10 12.32 453M 3.5M Transformer 2 12.73 453M 1.4M 262M Switch Head 2 9.77 2.0G 2.9M Transformer 16 9.80 5.4G 21M Transformer 2 10.09 5.4G 6.3M pe S2o 47M Transformer 10 12.83 453M 3.5M Switch Head 2 12.84 203M 0.8M Transformer 2 13.37 453M 1.4M 262M Transformer 16 9.78 5.4G 21M Switch Head 4 9.86 2.4G 5.6M Transformer 4 10.11 5.4G 8.4M Enwik8 41M Transformer 8 1.10 1.6G 10M Switch Head 2 1.10 709M 2.8M Transformer 2 1.13 1.6G 4.2M models closely match the performance of the full, many-head baseline with the fraction of memory and compute requirements (see Sec. 3.7 for more details). In addition, we verify the performance of our models trained on the C4 dataset downstream tasks in a zero-shot manner. We consider Lambada [24], BLi MP [25] and Children s Book Test (CBT) [26]. The results are shown in Table 4: our Switch Head models consistently outperform or match the performance of the baseline dense Transformer models. 3.4 Switch All The goal of achieving more resource-efficient Transformers includes reducing the resource requirements of both the MLP and the attention layers. σ-Mo E [17] was recently proposed as a parameterefficient Mo E method for accelerating the MLP layers. However, it remains unclear whether it can be efficiently combined with our Switch Head, or can have some negative interaction effect if combined in a "Switch All", where every layer is Mo E-based. To verify this, we take the baseline architecture of Csordás et al. [17] without any hyperparameter change and replace the attention layer with Switch Head. The hyperparameters for the attention are directly taken from the experiments shown in Tab. 2. The results are shown in Tab. 3. The combined, fully-Mo E model often outperforms the dense baselines for each dataset and model size considered, except in the case of the 262M parameter model on the C4 dataset. 3.5 MAC-Matched Setup All our experiments so far were calibrated so that the predictive performance (perplexity) matches to the performance of the baseline Transformer, and we were aiming for maximum resource savings. However, it is also a valid question to ask what is the performance of Switch Head in a MAC-matched setup, where the compute requirements of our model are matched to those of the baseline. We achieve this by increasing dhead and nheads until we have the same MAC requirements as the baseline. This results in a model with more parameters. For the small Transformer XL, we increase dhead from 76 to 112 and nheads from 2 to 3. For large XL, we increase nheads from 4 to 6 and dhead from 112 to 168. For the small Ro PE model, we change nheads from 2 to 3 and dmodel from 64 to 84, for big nheads from 4 to 6 and dmodel from 112 to 168. We show the results in Tab. 4: MAC-matched models outperform the others by a large margin both in perplexity and in zero-shot task performance. 3.6 Shared Selection For further time savings, we can share the expert selection between the source and destination side. Acceleration is achieved by reducing the number of sorting and top-k steps compared to the full Switch Head. However, this results in a minor performance loss, which might be tolerated in some cases where the acceleration is more important. See Tab. 4 for more details. 3.7 Wall-Clock Time and Memory Usage Estimation In all of our tables, we report the number of multiply-accumulate (MAC) operations following Zhang et al. [18]. The reason for this is that the actual wall-clock time is highly implementation and hardware-dependent. Nevertheless, we measured the runtime and total memory usage of our entire training pipeline (including the feedforward layer) to demonstrate that our current (suboptimal) implementation is already capable of providing wall-clock time acceleration. We show the results in Tab. 5. The measurements are taken on identical hardware with the same implementation (including for the attention core), the only difference being the Mo E-based projections for the attention. It can be seen that for both scales, Switch Head trains around 1.5 times faster, while using 61%-67% as much memory as the baseline. We also report the performance of Mo A for reference in Table 5. For measuring the resource usage of Mo A, we chose the fastest Mo A model that can match the performance of the dense baseline, or simply the best Mo A model when no Mo A model can match the baseline performance. This resulted in choosing Mo A with H = 4 for the 47M model and Mo A with H = 8 for the 262M parameter model. Switch Head outperforms Mo A on both scales, both in wall clock time and memory requirements. Note that these measurements also include the MLP layers, the optimizer, and the gradient synchronization in the case of multi-GPU training. Table 3: Performance of Switch All (Switch Head + σ-Mo E [17]) on different datasets and model sizes. Our Switch All model is close or better compared to the baselines. Models sorted by perplexity. Note: We show the parameter count of the dense model. The parameter count for the big Switch All model is 259M because of the imperfect parameter matching. Dataset #total params Model nheads ppl MACs Mem (floats) Wikitext 103 47M Switch All 2 12.17 170M 0.8M Transformer 10 12.32 453M 3.5M 262M Transformer 16 9.80 5.4G 21M Switch All 4 9.81 2.4G 5.6M C4 47M Switch All 2 22.09 202M 0.8M Transformer 10 22.63 453M 3.5M 262M Switch All 4 16.45 2.4G 5.6M Transformer 16 16.58 5.4G 21M pe S2o 47M Switch All 2 12.56 202M 0.8M Transformer 10 12.83 453M 3.5M 262M Transformer 16 9.78 5.4G 21M Switch All 4 9.86 2.4G 5.6M In order to see how the network uses the attention heads, we trained a small, 6-layer, 8-head Transformer on List Ops [33, 34]. The reason for this choice is that small, algorithmic tasks tend to be more interpretable compared to language modeling tasks. We also train a parameter-matched, 2-head Table 4: Performance of Switch Head trained on C4 dataset, compared to dense Transformer baseline with matched number of parameters. Model #total params ppl Lambada BLi MP CBT Switch Head 47M 22.53 20.4% 75.7% - Transformer 47M 22.71 20.4% 73.6% - Switch Head MAC-matched 63M 21.18 23.5% 77.1% - Switch Head Shared selection 47M 22.81 20.0% 74.6% - Switch Head 262M 16.23 29.4% 79.6% 83.3% Transformer 262M 16.28 28.2% 76.1% 83.6% Switch Head MAC-matched 376M 15.43 30.2% 79.4% 84.2% Switch Head Shared selection 262M 16.49 28.6% 79.4% 82.7% Table 5: Real-world resource usage of our method. The numbers shown below are for training time for the whole pipeline, including the feedforward layers. It can be seen that Switch Head in the current implementation reduces both the runtime and the memory usage by a factor of 1.4-1.5. Size Model ms/iteration Rel. iter. time RAM/GPU Rel. Mem. #GPUs GPU type 47M Transformer 473ms/iter 1.0 20.5G 1.0 1 RTX 3090 Switch Head 342ms/iter 0.72 13.5G 0.65 Mo A 412ms/iter 0.87 15.3G 0.75 262M Transformer 670ms/iter 1.0 20.5G 1.0 8 V100 Switch Head 442ms/iter 0.65 12.5G 0.61 Mo A 851ms/iter 1.27 16.4G 0.80 Switch Head model. Both models achieve around 95% accuracy on a held-out IID validation set, in contrast to the dense 2-head model, which saturates around 80%. Note that List Ops is a classification task and does not use autoregressive masking. We visualize the maximum of attention heads for each layer, both for the standard Transformer (Fig. 2a) and Switch Head (Fig. 2b). The attention maps are qualitatively similar. Due to different initialization and learning dynamics, thus the overlap between the two models would not be perfect. Complete attention map visualizations can be found in Fig. 4 and 3 in the appendix. In addition, we anlyze individual attention heads for Switch Head. We find that it is often possible to interpret the selection weights: on synthetic tasks, the output experts specialize according to different operations, while the input ones distinguish numbers and closed parentheses. The attention map itself appears to distribute information about contiguous chunks of numbers (see Fig. 5 in the appendix). Attention maps of the language models are more difficult to interpret. However, we visualize the attention maps of the 47M parameter Transformer XL and the Switch Head model from Tab. 2. We find them to be qualitatively similar. We also identified induction heads [35] in both models, some examples shown for Switch Head in Fig. 6a and for Transformer in Fig. 6b in the appendix. Other typical vertical line-lined attention patterns are shown in Fig. 6c and 6d. 5 Related Work The method most closely related to ours is Mo A [18], which introduces a Mo E style attention. It defines each attention head as an expert but shares the key and value projections between them. Unlike in our Switchhead, each of the selected experts requires a separate attention matrix, which significantly increases its memory usage. Due to the use of a competitive softmax-based activation function in the selection network, it requires complex regularization to prevent expert collapse [17]. In the original formulation, the number of active heads is high. Our experiments also confirm that Mo A needs many attention heads to match the performance of the dense baseline (see Sec. 3.2), and it is only possible to do so with a significantly higher resource budget than our method. Nguyen et al. [36] analyze the attention matrices, and they conclude that they are usually low rank. Motivated by this, the authors construct a few (e.g., 2) "global attention matrices", and they compute each local matrix for specific heads by a weighted average of those. However, they average the logits, B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E B [MED [MED ] 6 9 6 [MAX (a) Transformer, Layer 3 B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E B [MED [MED ] 6 9 6 [MAX (b) Switch Head Layer 3 Figure 2: An attention map of the (a) standard Transformer and (b) Switch Head. The maximum of all heads in the given layer are shown. not the final matrix, so each individual head-specific matrix has to be computed. This means that in the best case, they can only save half of the computation associated with the attention matrix because the readout (Eq. 3) is still needed. For the same reason, memory savings are also low. Peng et al. [19] propose to reweight the contribution of each head by a gating function. However, they only reduce the number of total attention heads by one, presumably to compensate for the parameters used by the selection logic. Their goal was not to reduce resource usage but to have better predictive performance, which they achieve. They use a softmax-based competitive selection mechanism. To avoid collapse, the gating function is trained only in some steps. More broadly, there have been several works on Mo E to accelerate language models. Shazeer et al. [11] introduce sparsely-gated mixture of experts. Fedus et al. [37] introduce Mixture of Experts in Transformers. Lepikhin et al. [13] train a Mo E-based LLM, and Clark et al. [15] analyze the scaling laws of Mo E models. Lewis et al. [12] introduce an alternative method for preventing collapse. However, none of these methods focus on the important, parameter-matched setting. Csordás et al. [17] introduce the non-competitive activation based Mo E method, σ-Mo E, which was shown to be successful in such a setting, but the authors only focused on accelerating the MLPs and not the attention. Multi-Query attention [38] uses a single key and value projection that is shared between the heads while using multiple queries. Our findings show that such a configuration is suboptimal: using multiple output and value projections is the most important choice in our model design. Dao et al. [39] provides a hardware-aware CUDA implementation of the entire attention layer, which avoids storing the attention matrix. By saving memory bandwidth in this way, they achieve a significant wall clock time speedup, despite that the attention matrix should be recomputed in the backward pass. This is orthogonal to our method and they can be combined for further acceleration. 6 Limitations Our models are modest in size compared to the current state-of-art LLMs. However, training such models is estimated to cost millions of dollars, which we cannot afford. Instead, we aim to show the versatility of our model by choosing a diverse set of datasets, including Enwik 8, Wikitext 103, C4 and pe S2o, and different positional encodings, such as Transformer-XL-style relative positional encoding and Ro PE. We also demonstrate the competitiveness of our models in zero-shot downstream tasks. We believe that the evidence we provided is enough for a research group with a larger amount of resources at their disposal to verify our findings in a state-of-the-art model. The Triton kernel that we used is currently around 60% of the speed of a single dense matrix multiplication of the size of a single expert with cu BLAS. Even this, we showed wall-clock time speedup. We estimate that 80-90% should be achievable with a more optimal kernel. Model-parallel training requires the implementation of a load-balancing system that can dynamically move experts between GPUs. 7 Conclusion On a wide range of language modeling datasets with different model sizes, our novel Mixtureof-Experts (Mo E) based attention method called Switch Head achieves performance of parametermatched dense counterparts, with only a fraction of the computational cost and memory usage. Switch Head drastically reduces the number of attention matrices that have to be computed, by using Mo E for the value and output projections. Our method is stable and does not need additional regularization to prevent degenerate solutions (a well-known practical issue in many existing Mo E models). Our method can also be successfully combined with Mo E MLP layers, to obtain Switch All" where every layer of the Transformer is Mo E-based, achieving a huge reduction in resource requirements. Acknowledgements This research was partially funded by ERC Advanced grant no: 742870, project Algo RNN, and by Swiss National Science Foundation grant no: 200021_192356, project NEUSYM. We are thankful for hardware donations from NVIDIA and IBM. The resources used for this work were partially provided by Swiss National Supercomputing Centre (CSCS) projects d123 and s1205. [1] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. [2] Tom B Brown et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems (Neur IPS), Virtual only, December 2020. [3] Open AI. Chatgpt. https://openai.com/blog/chatgpt, 2022. [4] Open AI. GPT-4 technical report. Preprint ar Xiv:2303.08774, 2023. [5] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco Túlio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4. Preprint ar Xiv:2303.12712, 2023. [6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 5998 6008, Long Beach, CA, USA, December 2017. [7] Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131 139, 1992. [8] Georgi Gerganov. llama.cpp. https://github.com/ggerganov/llama.cpp, 2023. [9] John B. Hampshire II and Alexander H. Waibel. The meta-pi network: connectionist rapid adaptation for high-performance multi-speaker phoneme recognition. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 165 168, Albuquerque, New Mexico, USA, April 1990. [10] Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts. Neural Compututaion, 3(1):79 87, 1991. [11] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Int. Conf. on Learning Representations (ICLR), Toulon, France, April 2017. [12] Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. BASE layers: Simplifying training of large, sparse models. In Proc. Int. Conf. on Machine Learning (ICML), volume 139, pages 6265 6274, Virtual only, July 2021. [13] Dmitry Lepikhin, Hyouk Joong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling giant models with conditional computation and automatic sharding. In Int. Conf. on Learning Representations (ICLR), Virtual only, May 2021. [14] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research (JMLR), 23(1):5232 5270, 2022. [15] Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake A. Hechtman, Trevor Cai, Sebastian Borgeaud, George van den Driessche, Eliza Rutherford, Tom Hennigan, Matthew Johnson, Katie Millican, Albin Cassirer, Chris Jones, Elena Buchatskaya, David Budden, Laurent Sifre, Simon Osindero, Oriol Vinyals, Jack W. Rae, Erich Elsen, Koray Kavukcuoglu, and Karen Simonyan. Unified scaling laws for routed language models. Preprint ar Xiv:2202.01169, 2022. [16] Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, Heyan Huang, and Furu Wei. On the representation collapse of sparse mixture of experts. In Proc. Advances in Neural Information Processing Systems (Neur IPS), New Orleans, Louisiana, USA, December 2022. [17] Róbert Csordás, Kazuki Irie, and Jürgen Schmidhuber. Approximating two-layer feedforward networks for efficient transformers. In Findings of the Association for Computational Linguistics: EMNLP 2023, November 2023. [18] Xiaofeng Zhang, Yikang Shen, Zeyu Huang, Jie Zhou, Wenge Rong, and Zhang Xiong. Mixture of attention heads: Selecting attention heads per token. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pages 4150 4162, Abu Dhabi, United Arab Emirates, December 2022. [19] Hao Peng, Roy Schwartz, Dianqi Li, and Noah A. Smith. A mixture of h - 1 heads is better than h heads. In Proc. Association for Computational Linguistics (ACL), pages 6566 6577, Virtual only, July 2020. [20] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 21:140:1 140:67, 2020. [21] Marcus Hutter. The human knowledge compression prize. http://prize.hutter1.net, 2006. [22] Luca Soldaini and Kyle Lo. pe S2o (Pretraining Efficiently on S2ORC) Dataset. Technical report, Allen Institute for AI, 2023. https://github.com/allenai/pes2o. [23] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In Int. Conf. on Learning Representations (ICLR), Toulon, France, April 2017. [24] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proc. Association for Computational Linguistics (ACL), Berlin, Germany, August 2016. [25] Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman. Blimp: The benchmark of linguistic minimal pairs for english. Transactions of the Association for Computational Linguistics (TACL), 8:377 392, 2020. [26] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. The goldilocks principle: Reading children s books with explicit memory representations. In Int. Conf. on Learning Representations (ICLR), San Juan, Puerto Rico, May 2016. [27] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLa MA: Open and efficient foundation language models. Preprint ar Xiv:2302.13971, 2023. [28] Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Ro Former: Enhanced transformer with rotary position embedding. Preprint ar Xiv:2104.09864, 2021. [29] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proc. Association for Computational Linguistics (ACL), pages 1715 1725, Berlin, Germany, August 2016. [30] Mike Schuster and Kaisuke Nakajima. Japanese and korean voice search. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 5149 5152, Kyoto, Japan, March 2012. [31] Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pages 66 71, Brussels, Belgium, October 2018. [32] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proc. Association for Computational Linguistics (ACL), pages 2978 2988, Florence, Italy, 2019. [33] Nikita Nangia and Samuel R. Bowman. List Ops: A diagnostic dataset for latent tree learning. In Proc. North American Chapter of the Association for Computational Linguistics on Human Language Technologies (NAACL-HLT), pages 92 99, New Orleans, USA, June 2018. [34] Róbert Csordás, Kazuki Irie, and Jürgen Schmidhuber. The neural data router: Adaptive control flow in transformers improves systematic generalization. In Int. Conf. on Learning Representations (ICLR), Virtual only, April 2022. [35] Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova Das Sarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam Mc Candlish, and Chris Olah. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-inductionheads/index.html. [36] Tan Nguyen, Tam Nguyen, Hai Do, Khai Nguyen, Vishwanath Saragadam, Minh Pham, Duy Khuong Nguyen, Nhat Ho, and Stanley J. Osher. Improving transformer with an admixture of attention heads. In Proc. Advances in Neural Information Processing Systems (Neur IPS), New Orleans, LA, USA, November 2022. [37] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Preprint ar Xiv:2101.03961, 2021. [38] Noam Shazeer. Fast transformer decoding: One write-head is all you need. Preprint ar Xiv:1911.02150, 2019. [39] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flash Attention: Fast and memory-efficient exact attention with IO-awareness. In Proc. Advances in Neural Information Processing Systems (Neur IPS), New Orleans, Louisiana, USA, December 2022. [40] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Int. Conf. on Learning Representations (ICLR), San Diego, CA, USA, May 2015. A.1 A Comment on Flash Attention The resource reductions from Flash Attention might be, in many cases, larger than those from our method alone. However, Flash Attention depends on GPU-specific memory bandwidth/compute trade-offs, which might not be available on all hardware, especially on edge devices. Switch Head and Flash Attention can also be combined for further speedups. We demonstrated the viability of this setup in our Ro PE experiments. Additionally, certain architectures, such as shared-layer transformers, might require a drastic increase in the number of heads, which Flash Attention alone might not be able to do. A.2 Resource Usage of Different Methods In this section, we discuss the compute and memory usage of different attention variants. We will define the compute in terms of the number of multiply-accumulate operations (MACs, also used by Zhang et al. [18]), which is arguably better defined than FLOPs (e.g., does one step of the matrix multiplication count as 1 FLOP or 2? Do we include the softmax?). All calculations will be presented for a single attention layer for a single sequence, and they are presented this way in all our tables. Both the memory and compute requirements scale linearly with both the batch size and the number of layers. Consider a sequence of inputs of length T, with representation size dmodel. Let dhead be the width of the key, query and value projections used for the attention layer. For Transformer XL-style attention, let the size of the context be CT, where C 1 is the number of past chunks included in the context of the current attention step. We can divide the computation into two major parts: calculating the projections, which do not involve the attention map, and calculating the attention map and projecting the sequence of values using it. First, consider the case of the standard Transformer XL [32]. Here, from the input x RT dmodel, we calculate the Kh, Qh, V h RT dhead using projection matrices of shape Rdmodel dhead. The output after the attention is projected in a similar manner (Eq. 3). Thus, the projections take a total of 4Tdmodeldhead MACs per head. For backpropagation, we have to store all the intermediate results. This takes Tdhead numbers of Kh, Qh and V h. Also, the projected values should be stored. They have an identical shape, therefore, the total memory used by projections is 4Tdhead numbers per head. Now consider the resource usage related to the attention matrix. It involves calculating the product of Qh Kh , which takes dhead CT 2 MACs (multiplication by C is needed because the shape of Kh and V h for Transformer XL is CT dhead). The projection of the values with the attention matrix Ah V h is similar. For the memory usage, the attention needs CT 2 numbers, but it needs to be stored both before and after the activation function. In addition, calculating the projection of the position encodings is necessary. This depends on the implementation, but in our case, it involves a matrix multiplication, and the total amount of computation is 2dheaddmodel TC, and it needs 2dhead TC numbers of storage. Thus the resource requirements are: N XL MAC = nheads 4Tdheaddmodel + 2CT 2dhead + 2CTdheaddmodel (11) N XL mem = nheads 4Tdhead + 2CT 2 + 2CTdhead (12) The resource usage of Switch Head is different. First, the number of heads nheads is significantly reduced, but dhead is typically larger. Additionally, there are k experts active at the same time. Here, we only consider the case where the value and outputs are experts, but Qh and Kh are not (this version performs the best; see Sec. 3.1). Then, we have two projections that are identical with that of Transformer XL, and two Mo E-based projections. These use Tkdmodeldhead MACs to calculate the projection and another Tkdhead to calculate their weighted average. With a smart kernel implementation, memory usage is not affected by k, thus the formula remains the same as Eq. 12 (note, however, that nheads and dhead are very different in practice). The compute requirement can be calculated as: N Switch Head MAC = nheads 2Tdheaddmodel +2Tkdhead(dmodel +1)+2CT 2dhead +2CTdheaddmodel (13) Additionally, the expert selection logic needs minimal additional resources, which can be ignored. Note that the comparison between the MACs of the standard (Eq. 11) and Switch Head (Eq. 13) depends on the exact values of the hyper-parameters. However, as we ll see in Sec. 3, in our typical Table 6: Performance of Switch Head with E = 5 experts and nheads = 2 heads. Different projections are either experts or fixed for the given head. Columns V, K, Q, and O show whether the given projection is an expert. Parameter-matched baseline with nheads = 10 and nheads = 2 are shown. Models sorted by perplexity. 47M parameters models on Wikitext 103. Model nheads V K Q O Perplexity Switch Head 2 Y N N Y 12.27 Switch Head 2 N N N Y 12.30 Transformer 10 - - - - 12.31 Switch Head 2 N Y N Y 12.36 Switch Head 2 Y Y N Y 12.37 Switch Head 2 Y N Y Y 12.42 Switch Head 2 Y N N N 12.45 Switch Head 2 N N Y Y 12.45 Switch Head 2 Y N Y N 12.51 Switch Head 2 Y Y Y Y 12.57 Switch Head 2 N Y Y Y 12.59 Switch Head 2 Y Y Y N 12.61 Switch Head 2 Y Y N N 12.69 Transformer 2 - - - - 12.74 Switch Head 2 N N Y N 12.75 Switch Head 2 N Y N N 12.79 Switch Head 2 N Y Y N 12.90 configurations, Switch Head provides good predictive performance with significantly lower nheads compared to the standard Transformer, resulting in reduced resource usage in the end. The resource requirements of Mo A [19] are very similar to those of Transformer XL , except that it uses a single shared key and value projection for each head. N Mo A MAC = (2nheads + 2)Tdheaddmodel + 2nheads CT 2dhead + 2CTdheaddmodel (14) N Mo A mem = (2nheads + 2)Tdhead + 2nheads CT 2 + 2CTdhead (15) A.3 The Importance of Different Projections In order to analyze which projections are the most important to be mixture-of-experts, we exhaustively tried all combinations. We analyze our 47M parameter models on Wiki Text 103 dataset. We show the results in Tab. 6. We also include a parameter-matched baseline with two heads, which serves as a lower bound for the performance. We found that the value and output projections are the most important, and having key and query projections hurts the performance. This is possible because we perform all our experiments in a parameter-matched setting. Allocating parameters to these projections uses the budget that can be otherwise spent on other parts of the network. In our preliminary experiments, we found that, allowing the parameter budget to increase, more experts always help. A.4 Ro PE Positional Encodings All of our experiments in the main paper have used a Transformer XL model. Thus, it remains unclear whether Switch Head is specific to this model or can be also used with other attention methods. As an alternative, we consider Ro PE positional encodings [28] without the XL cache (thus, the attention matrices are square). This is the standard setup used by modern language models, such as all versions of Llama [27]. We tested these models in Wikitext 103 and C4. The results are shown in Tab. 7, and zero-shot performance on downstream tasks in Tab. 8. This shows that Switch Head also performs well in the standard setup and is not tied to Transformer XL. A.5 Hyperparameters We train all our models with Adam optimizer [40], with a batch size of 64, a learning rate of 0.00025, and gradient clipping with a maximum norm of κ. Large models (> 200K parameters) use a learning Table 7: Perplexity of Switch Head compared to dense baseline, using Ro PE positional encoding and no XL cache. Memory usage is specified in number of floats. Models sorted by perplexity. Dataset #total params Model nheads ppl MACs Memory Wikitext 103 45M Switch Head 2 12.75 285.6M 1.3M Transformer 10 12.78 560.9M 6.1M Transformer 2 12.96 560.9M 1.9M 244M Switch Head 4 10.00 4.2G 18.4M Transformer 16 10.17 6.4G 37.7M Transformer 2 10.26 6.4G 8.4M C4 45M Switch Head 2 23.69 285.6M 1.3M Transformer 10 23.79 560.9M 6.1M 244M Switch Head 4 16.41 4.2G 18.4M Transformer 16 16.35 6.4G 37.7M Table 8: Zero-shot task performance of Switch Head using Ro PE positional encodings and no XL cache, trained on C4 dataset, compared to dense Transformer baseline with matched number of parameters. Model #total params ppl Lambada BLi MP CBT Switch Head 45M 23.69 20.9% 77.3% - Transformer 45M 23.76 20.3% 73.8% - Switch Head MAC-matched 54M 22.18 22.6% 77.4% - Switch Head Shared selection 45M 23.63 20.3% 76.0% - Switch Head 243M 16.41 30.5% 79.9% 83.8% Transformer 243M 16.35 29.8% 76.1% 83.9% Switch Head MAC-matched 314M 15.63 30.5% 80.5% 84.6% Switch Head Shared selection 243M 16.59 28.1% 79.1% 83.7% rate warm-up of 4k steps. All models, except the Switch All model, use a dropout on the MLP layers, 0.1 for the small models and 0.2 for the large ones. Detailed hyperparameters are shown in the Tab. 9. σ-Mo E related hyperparameters for the Switch All models are identical to those of Csordás et al. [17]. For Transformer XL models, we always use a single additional chunk of context, both in training and validation time. dhead and dff are derived in a systematic way, see Sec. 3 for more details. A.6 A Note on the Parameter Count of the Switch All It can be seen in Tab. 3 that the parameter count of the Switch All models is often less than that of their dense counterparts. The reason is that we normally compensate for the final difference in the number of parameters by increasing dff (see Sec. 3 for details of the parameter matching). However, that can only be done in a very coarse-grained way with σ-Mo E: the size of all experts must be increased at once, and the CUDA kernel supports only sizes of multiple of 4. Therefore, increasing the size of the experts would add too many parameters and the model would outgrow the baseline. For this reason, we simply keep the hyperparameters for Csordás et al. [17] and combine them with our Switch Head configuration from Tab. 2. A.7 Visalizing all Attention Heads As discussed in Sec. 4, we analyze the attention maps of Switch Head and compare them with the dense models. We show all the attention maps of the models trained on List Ops in Fig. 3 and Fig. 3. We show individual heads of Switch Head, including the expert selection scores in Fig. 5. Some selected attention maps of our 47M parameter models on Wikitext 103 are shown in Fig. 6. A.8 Compute Requirements We report the compute used for our experiments, including the GPU type, count (the number of GPUs used per experiment, and not the total in the machine), and the runtime in hh:mm format Table 9: Hyperparameters used for our models. Model Dataset nheads #params dhead dff E k T nlayers κ Switch Head C4 2 47M 76 2080 5 3 256 16 0.1 Transformer 10 47M 41 2053 - - 256 16 0.1 Transformer 2 47M 205 2053 - - 256 16 0.1 Switch Head C4 4 262M 112 4188 4 2 512 18 0.25 Transformer 16 262M 64 4110 - - 512 18 0.25 Transformer 4 262M 256 4110 - - 512 18 0.25 Switch Head Wikitext 103 2 47M 76 2080 5 2 256 16 0.1 Transformer 10 47M 41 2053 - - 256 16 0.1 Transformer 2 47M 205 2053 - - 256 16 0.1 Switch Head Wikitext 103 2 262M 132 4147 8 4 512 18 0.25 Transformer 16 262M 64 4110 - - 512 18 0.25 Transformer 2 262M 512 4110 - - 512 18 0.25 Switch Head pe S2o 2 47M 76 2080 5 3 256 16 0.1 Transformer 10 47M 41 2053 - - 256 16 0.1 Transformer 2 47M 205 2053 - - 256 16 0.1 Switch Head pe S2o 4 262M 112 4188 4 2 512 18 0.25 Transformer 16 262M 64 4110 - - 512 18 0.25 Transformer 4 262M 256 4110 - - 512 18 0.25 Switch Head Enwik8 2 41M 112 2088 4 2 512 12 0.25 Transformer 8 41M 64 2053 - - 512 12 0.25 Transformer 2 41M 256 2053 - - 512 12 0.25 Switch Head (Ro PE) Wikitext 103 2 45M 64 2092 5 3 512 16 0.1 Transformer (Ro PE) 10 45M 41 2053 - - 512 16 0.1 Switch Head (Ro PE) Wikitext 103 4 243M 100 4136 4 2 1024 18 0.25 Transformer (Ro PE) 16 244M 64 4110 - - 1024 18 0.25 Switch All Wikitext 103 2 47M 76 1648 5 2 256 16 0.25 Switch All Wikitext 103 4 259M 112 4096 4 2 512 18 0.25 Switch All C4 2 47M 76 1648 5 3 256 16 0.25 Switch All C4 4 259M 112 4096 4 2 512 18 0.25 Switch All pe S2o 2 47M 76 1648 5 3 256 16 0.25 Switch All pe S2o 4 259M 112 4096 4 2 512 18 0.25 in Tab. 10. We report the total number of CPUs (NCPU) and RAM because they are shared between concurrent runs. Note that most of the experiments were done prior to the much faster, Triton-based kernel implementation. Because of this, the runtimes appear longer for Switc Head compared to the baseline. For timing benchmarks with our new kernel, see Tab. 5. Note that we only report the resources used for the paper here. We estimate that the total cost of the failed experiments and preliminary runs is around 10 times higher than this. B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E B [MED [MED ] 6 9 6 [MAX (a) Layer 1 B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E B [MED [MED ] 6 9 6 [MAX (b) Layer 2 B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E B [MED [MED ] 6 9 6 [MAX (c) Layer 3 B [MED [MED [MIN0895]8809]696 [MAX76 [MAX4 [MAX5 [MAX3808] [MIN26]5]154]5]]E B [MED [MED ] 6 9 6 [MAX (d) Layer 4 B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E B [MED [MED ] 6 9 6 [MAX (e) Layer 5 B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E B [MED [MED ] 6 9 6 [MAX (f) Layer 6 Figure 3: The maximum of all attention maps for a Switch Head model on List Ops. B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E B [MED [MED ] 6 9 6 [MAX (a) Layer 1 B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E B [MED [MED ] 6 9 6 [MAX (b) Layer 2 B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E B [MED [MED ] 6 9 6 [MAX (c) Layer 3 B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E B [MED [MED ] 6 9 6 [MAX (d) Layer 4 B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E B [MED [MED ] 6 9 6 [MAX (e) Layer 5 B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E B [MED [MED ] 6 9 6 [MAX (f) Layer 6 Figure 4: The maximum of all attention maps for a standard Transformer model on List Ops. B [MED [MED 0 8 9 5 ] 8 8 0 9 ] 6 9 6 [MAX 3 8 0 8 ] [MIN 2 6 ] 5 ] 1 5 4 ] 5 ] ] E B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E (a) Layer 1, head 1 B [MED [MED 0 8 9 5 ] 8 8 0 9 ] 6 9 6 [MAX 3 8 0 8 ] [MIN 2 6 ] 5 ] 1 5 4 ] 5 ] ] E B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E (b) Layer 1, head 2 B [MED [MED 0 8 9 5 ] 8 8 0 9 ] 6 9 6 [MAX 3 8 0 8 ] [MIN 2 6 ] 5 ] 1 5 4 ] 5 ] ] E B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E (c) Layer 2, head 1 B [MED [MED 0 8 9 5 ] 8 8 0 9 ] 6 9 6 [MAX 3 8 0 8 ] [MIN 2 6 ] 5 ] 1 5 4 ] 5 ] ] E B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E (d) Layer 2, head 2 B [MED [MED 0 8 9 5 ] 8 8 0 9 ] 6 9 6 [MAX 3 8 0 8 ] [MIN 2 6 ] 5 ] 1 5 4 ] 5 ] ] E B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E (e) Layer 3, head 1 B [MED [MED 0 8 9 5 ] 8 8 0 9 ] 6 9 6 [MAX 3 8 0 8 ] [MIN 2 6 ] 5 ] 1 5 4 ] 5 ] ] E B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E (f) Layer 3, head 2 B [MED [MED 0 8 9 5 ] 8 8 0 9 ] 6 9 6 [MAX 3 8 0 8 ] [MIN 2 6 ] 5 ] 1 5 4 ] 5 ] ] E B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E (g) Layer 4, head 1 B [MED [MED 0 8 9 5 ] 8 8 0 9 ] 6 9 6 [MAX 3 8 0 8 ] [MIN 2 6 ] 5 ] 1 5 4 ] 5 ] ] E B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E (h) Layer 4, head 2 B [MED [MED 0 8 9 5 ] 8 8 0 9 ] 6 9 6 [MAX 3 8 0 8 ] [MIN 2 6 ] 5 ] 1 5 4 ] 5 ] ] E B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E (i) Layer 5, head 1 B [MED [MED 0 8 9 5 ] 8 8 0 9 ] 6 9 6 [MAX 3 8 0 8 ] [MIN 2 6 ] 5 ] 1 5 4 ] 5 ] ] E B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E (j) Layer 5, head 2 B [MED [MED 0 8 9 5 ] 8 8 0 9 ] 6 9 6 [MAX 3 8 0 8 ] [MIN 2 6 ] 5 ] 1 5 4 ] 5 ] ] E B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E (k) Layer 6, head 1 B [MED [MED 0 8 9 5 ] 8 8 0 9 ] 6 9 6 [MAX 3 8 0 8 ] [MIN 2 6 ] 5 ] 1 5 4 ] 5 ] ] E B [MED [MED [MIN0895 ]8809 ]696 [MAX76 [MAX4 [MAX5 [MAX3808 ] [MIN26 ]5 ]154 ]5 ] ]E (l) Layer 6, head 2 Figure 5: Details for individual heads of the Switch Head model on List Ops. On the left side of each attention plot, the selection of the output projection expert is shown. Similarly, at the bottom, the selection of the value projection selection is visible. In the selection maps, dark blue always corresponds to 1, while white is 0. The adaptive scale shown to the right of the attention map is for the map only. lob ster red "on cook ing .M ating occursin the summer , producing eggs which are carriedby the femalesforuptoa year beforeh atch ing intopl ank toniclarvae .H omarusg am marusisa highlyeste emed food , andis widely caught usinglob sterpots , mostly around the British Isles .== Desc ription==H omarusg am marusisa largecr ust acean , witha body lengthupto 60 cent imetres ( 24in ) andwe igh ingupto 6kilog rams . M ating occurs in the summer , producing are carried by the females for up to a year before us is a highly is widely caught , mostly around the British . = = Desc ription us is a large a body length 6 0 cent imet we igh ing up to 6 kil og rams (a) Switch Head Layer 12. Induction head. lob ster red "on cook ing .M ating occursin the summer , producing eggs which are carriedby the femalesforuptoa year beforeh atch ing intopl ank toniclarvae .H omarusg am marusisa highlyeste emed food , andis widely caught usinglob sterpots , mostly around the British Isles .== Desc ription==H omarusg am marusisa largecr ust acean , witha body lengthupto 60 cent imetres ( 24in ) andwe igh ingupto 6kilog rams . M ating occurs in the summer , producing are carried by the females for up to a year before us is a highly is widely caught , mostly around the British . = = Desc ription us is a large a body length 6 0 cent imet we igh ing up to 6 kil og rams (b) Transformer XL Layer 10. Induction head. lob ster red "on cook ing .M ating occursin the summer , producing eggs which are carriedby the femalesforuptoa year beforeh atch ing intopl ank toniclarvae .H omarusg am marusisa highlyeste emed food , andis widely caught usinglob sterpots , mostly around the British Isles .== Desc ription==H omarusg am marusisa largecr ust acean , witha body lengthupto 60 cent imetres ( 24in ) andwe igh ingupto 6kilog rams . M ating occurs in the summer , producing are carried by the females for up to a year before us is a highly is widely caught , mostly around the British . = = Desc ription us is a large a body length 6 0 cent imet we igh ing up to 6 kil og rams (c) Switch Head Layer 9. Stripe pattern. lob ster red "on cook ing .M ating occursin the summer , producing eggs which are carriedby the femalesforuptoa year beforeh atch ing intopl ank toniclarvae .H omarusg am marusisa highlyeste emed food , andis widely caught usinglob sterpots , mostly around the British Isles .== Desc ription==H omarusg am marusisa largecr ust acean , witha body lengthupto 60 cent imetres ( 24in ) andwe igh ingupto 6kilog rams . M ating occurs in the summer , producing are carried by the females for up to a year before us is a highly is widely caught , mostly around the British . = = Desc ription us is a large a body length 6 0 cent imet we igh ing up to 6 kil og rams (d) Transformer XL Layer 8. Stripe pattern. Figure 6: Induction head copying the rare name "Homarus" in (a) Switch Head and (b) Transformer XL baseline. The attention matrix is square because it is the first chunk of the sequence, without any extra context. Typical vertical line pattern in (c) Switch Head and (b) Transformer XL baseline. Table 10: Training hardware information for the experiments reported in the paper Model #params Dataset G GPU Type NGPU NCPU RAM Duration Switch All 259M C4 4 V100-32GB-LS 8 40 503G 24:06 Switch All 259M pe S2o 4 V100-32GB-LS 8 40 503G 30:00 Switch All 259M Wikitext 103 4 RTX 4090 4 24 251G 22:58 Switch All 47M C4 2 RTX 3090 1 24 220G 22:14 Switch All 47M pe S2o 2 RTX 3090 1 24 220G 22:49 Switch All 47M Wikitext 103 2 RTX 3090 1 24 251G 6:03 Switch Head 243M Wikitext 103 4 V100-32GB 4 40 503G 147:09 Switch Head 262M C4 4 V100-32GB-LS 8 40 503G 26:38 Switch Head 262M pe S2o 4 V100-32GB-LS 8 40 503G 27:43 Switch Head 262M Wikitext 103 2 V100-32GB 4 40 503G 31:42 Switch Head 41M Enwik8 2 V100-32GB 1 40 503G 13:45 Switch Head 45M Wikitext 103 2 RTX 3090 1 24 251G 17:28 Switch Head 47M C4 2 V100-32GB 1 40 503G 15:36 Switch Head 47M pe S2o 2 V100-32GB 1 40 503G 16:17 Switch Head 47M Wikitext 103 2 RTX 3090 1 24 251G 13:09 Transformer 262M C4 4 V100-32GB 8 40 503G 11:55 Transformer 262M C4 16 V100-32GB-LS 8 40 503G 20:21 Transformer 262M pe S2o 4 V100-32GB 8 40 503G 17:08 Transformer 262M pe S2o 16 V100-32GB-LS 8 40 503G 25:56 Transformer 262M Wikitext 103 2 P100-16GB 8 12 62G 0:00 Transformer 262M Wikitext 103 16 A100-80GB 2 64 503G 31:51 Transformer 41M Enwik8 2 RTX 3090 1 24 220G 15:38 Transformer 41M Enwik8 8 V100-32GB-LS 2 40 503G 16:04 Transformer 47M C4 2 V100-32GB 1 40 503G 10:29 Transformer 47M C4 10 V100-32GB 1 40 503G 16:57 Transformer 47M pe S2o 2 V100-32GB 1 40 503G 11:07 Transformer 47M pe S2o 10 V100-32GB 1 40 503G 17:55 Transformer 47M Wikitext 103 2 V100-32GB 1 40 503G 10:06 Transformer 47M Wikitext 103 10 V100-32GB 1 40 503G 18:51 Transformer (Ro PE) 244M Wikitext 103 16 RTX 3090 4 24 251G 30:30 Transformer (Ro PE) 45M Wikitext 103 10 V100-32GB 1 40 503G 15:30 Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We summarized the motivation, method, and main findings in these sections. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We discuss the limitations in Sec. 6. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [NA] Justification: This is an empirical paper. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We show all the hyperparameter configurations in Appendix A.5, and we provide the code for our experiments. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide the code for our experiments. It automatically downloads all the data that it needs. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Our methodology is explained in Sec. 3, and the full table of hyperparameters is presented in Appendix A.5. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: Our experiments involve large models that are very expensive to train, and we do not have sufficient compute resources to run multiple seeds of them. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We report the type of hardware used for our main experiments in Appendix A.8. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We read the Ethics guidelines, and to the best of our knowledge, we are complying with it. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: We consider our paper to be a foundational research paper without direct consequences. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The models in this paper are small by modern standards and we do not release pre-trained weights. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: Justification: Our code is under MIT license and the paper is CC-BY 4.0. To the best of our knowledge, we always credit the reused code if we reuse any. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: We provide the source code and instructions on how to run it. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: We do not work with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: We do not work with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.