# layerwise_recurrent_router_for_mixtureofexperts__b87d623a.pdf

Published as a conference paper at ICLR 2025

LAYERWISE RECURRENT ROUTER FOR MIXTURE-OF-EXPERTS

1Zihan Qiu 2Zeyu Huang 3Shuang Cheng 4Yizhi Zhou 5Zili Wang 2,6Ivan Titov 7Jie Fu

1Alibaba Group, 2University of Edinburgh, 3ICT, Chinese Academy of Sciences, 4Nanjing University, 5INF Technology 6University of Amsterdam 7Shanghai AI Lab qzh11628@gmail.com, zeyu.huang@ed.ac.uk, fujie@pjlab.org.cn

The scaling of large language models (LLMs) has revolutionized their capabilities in various tasks, yet this growth must be matched with efficient computational strategies. The Mixture-of-Experts (Mo E) architecture stands out for its ability to scale model size without significantly increasing training costs. Despite their advantages, current Mo E models often display parameter inefficiency. For instance, a pre-trained Mo E-based LLM with 52 billion parameters might perform comparably to a standard model with 6.7 billion parameters (Rajbhandari et al., 2022). Being a crucial part of Mo E, current routers in different layers independently assign tokens without leveraging historical routing information, potentially leading to suboptimal token-expert combinations and the parameter inefficiency problem. To alleviate this issue, we introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMo E). RMo E leverages a Gated Recurrent Unit (GRU) to establish dependencies between routing decisions across consecutive layers. Such layerwise recurrence can be efficiently parallelly computed for input tokens and introduces negotiable costs. Our extensive empirical evaluations demonstrate that RMo E-based language models consistently outperform a spectrum of baseline models. Furthermore, RMo E integrates a novel computation stage orthogonal to existing methods, allowing seamless compatibility with other Mo E architectures. Our analyses attribute RMo E s gains to its effective cross-layer information sharing, which also improves expert selection and diversity. Our code is at https://github.com/qiuzh20/RMo E.

1 INTRODUCTION

In the era of large language models (LLMs), scaling the model parameters and training data up has unlocked remarkable model capabilities, such as in-context learning (Brown et al., 2020; Dong et al., 2022), nuanced conversations (Ouyang et al., 2022), and even complex code (Guo et al., 2024) and math (Imani et al., 2023) tasks. These advancements showcase the profound impact of increasing model size. The quest to enhance neural networks capacity while ensuring training and inference efficiency spurred the development of computation-efficient transformer architectures. The Mixture-of-Experts (Mo E) framework is one of such efficient architectural recipes (Shazeer et al., 2017; Lepikhin et al., 2021; Fedus et al., 2022; Zhang et al., 2022; Dai et al., 2024). Most Mo E modules comprise one router and a group of expert networks. The router, usually parametrized as one linear layer, conditionally and sparsely assigns each input token to its corresponding experts, i.e., the Feed Forward Network (FFN) in the transformer layer. Therefore, Mo E can significantly scale the model size and keep computational costs nearly unchanged (Smith et al., 2022).

Despite efficiently increasing the model size, most current pre-trained Mo E models are not on par with standard models of the same size, demonstrating their parameter inefficiency. For example, Rajbhandari et al. (2022) shows that with the same training data, an Mo E with 52B parameters and 1.3B activated ones for each token performs similarly to a 6.7B standard model. Komatsuzaki

Equal contribution Corresponding author

Published as a conference paper at ICLR 2025

Experts Hidden forward

Expert scores forward

GRU state forward

Layerwise recurrence

Figure 1: Recurrent router for Mixture-of-Experts. In the i-th layer, the hidden state xi is I. projected to x

with alower hidden dimension (Eq. 4), II. combined with previous layer s GRU output hi 1, and processed through the cross-layer-shared GRU to produce the current layer s GRU output, hi (Eq. 5). III. layer i s router uses this output to select experts and executes standard Mo E computation (Eq. 6). Such operation doesn t introduce sequence-level recurrence and can be efficiently implemented, as shown in Tab. 1 and Tab. 3.

et al. (2023) demonstrates that upcycling a standard T5-base (248M) into its Mo E counterpart (2B) by copying existing FFN can bring some improvements, but it still lags behind the T5-large with 783M parameters. Similarly, Dai et al. (2024) use fine-grained and shared experts to improve the effectiveness, but the 16B Mo E performs comparably with the 7B standard model (Bi et al., 2024).

One potential bottleneck for the current Mo E could be the router. Typically, the router is parameterized as one lightweight linear layers, which may limit its capacity to explore the optimal tokenexpert combination. Previous works also reveal such limitations. For instance, Xue et al. (2024) finds the routing results converge to the token-id-based routing very quickly during the early phase of pre-training, which means the token-expert combination is far from well-explored. Some works even show hash functions (Roller et al., 2021), stochastic routing policy (Zuo et al., 2021), and fixed-random router (Chen et al., 2023) achieves competitive performance with the learnable router, illustrating that the learnable router component in Mo E needs further enhancement.

Despite some enhancements for router (Chi et al., 2022; Shen et al., 2023; Do et al., 2023; Chen et al., 2023), current routers in different Mo E layers still operate independently without comprehensive investigations into the decisions of other layers. This isolation may lead to suboptimal expert utilization, as each layer manages its routing based solely on local information, potentially leading to inefficiency of model parameters. Though vanilla Mo E models could technically share the routing information via hidden states residual, this information may be overshadowed by the language modelling loss, requiring routing-relevant information to compete for its representation.

To this end, we introduce a dedicated component to capture and pass routing information for each layer. The proposed architecture, Recurrent Router for Mixture-of-Experts (RMo E), is shown in the Fig. 1. Concretely, we regard routing decisions in consecutive layers as a sequence in which the routing results of the i-th layer should be conditioned on previous layers decisions. We thus introduce a lightweight Gated Recurrent Unit (GRU) (Dey & Salem, 2017) to capture this dependence and simulate the information flow between routers across layers. Intuitively, GRU has a reset and an update gate to control the information flow across time steps. Hence, such layerwise recurrence will inform the router to which experts the current token was assigned in previous layers, potentially supporting cross-layer collaborations. Furthermore, the introduced GRU is especially for routing. It thus helps to disentangle the states relevant to model prediction and routing decisions.

We validate RMo E s performance with various model sizes, architectures, datasets, and training settings (per-training and supervised fine-tuning), demonstrating that RMo E outperforms a range of baselines. Moreover, RMo E s introduction of a novel computation stage during routing makes it orthogonal to and compatible with most existing methods. We further analyze RMo E and elucidate the primary contributors to its improvement. Our findings indicate that while the GRU in RMo E shares essential cross-layer information, it also enables additional gradient propagation for the router. Our analysis shows that layerwise recurrence provides cross-layer information, fostering router exploration and optimizing expert utilization. Consequently, the selected experts are leveraged more effectively, leading to increased diversity of experts. We believe that our innovative router design and massive analysis can offer insights into the development of future Mo E models.

Published as a conference paper at ICLR 2025

2 RELATED WORKS: VARIOUS ROUTERS FOR MOE

In this section, we review previous approaches to improve router design in SMo E. For example, XMo E (Chi et al., 2022) first projects hidden states into a lower-dimension space and computes their cosine-similarity to low-dimension expert embeddings, which can prevent the hidden states from collapsing to a linear combination of expert embeddings. Moduleformer (Shen et al., 2023) uses an MLP router with Re LU activation to increase router capacity. SMo E-dropout (Chen et al., 2023) utilizes a fixed random-initialized linear router and gradually increases Top-k during training. Hyper Mo E (Do et al., 2023) introduces a fixed random-initialized hypernet (Ha et al., 2016) at each layer to generate router weights condition on input and one learnable router embedding. One concurrent work (Gong et al., 2024) also introduces GRU in sequential routing stages. However, it does not view such a recurrent mechanism as a general and composable method with broad Mo E fields or provide relative ablation or analysis. Extra discussion of related work to improve Mo E from routing and training strategies, and utilize recurrent controllers can be found in App. A.1.

3 METHODOLOGY

3.1 PRELIMINARIES

Mixture-of-Experts Mo Es are typically implemented by replacing transformer models original feed-forward networks (FFNs) with a group of parallel FFNs and incorporating a router. Suppose there are N experts, denoted as En, n [1, N]. The router g( ; G, k), defined by its parameters G R(h,N) and an integer k, maps the input x to a score distribution over the experts, g(x; G, k) RN. Given x Rh, the output y Rh is the weighted sum of the outputs from all experts:

n N gn(x; G, k)En(x) (1)

Typically, g is a simple linear layer followed by a softmax and a Top-k function. The n th element of x G RN represents the gating score of expert En, and the n th column of G can be regarded as the expert embedding for expert En. When k for Top-k is smaller than N, only a subset of experts is involved in the computation, which is known as Sparse Mixture-of-Experts (SMo E) (Shazeer et al., 2017; Fedus et al., 2022).

Recurrent Neural Networks RNNs (Medsker et al., 2001) are designed to handle sequential data by maintaining a hidden state h that holds the information from previous time steps. This hidden state is updated at each time step i based on the current input x i and the hidden state at the last time step hi 1, formulated as hi = f(hi 1, x i).

The Gated Recurrent Units (GRU) Dey & Salem (2017) module is an advanced variant of RNNs that addresses traditional RNNs limitations, such as difficulty capturing long-term dependencies and gradient vanishing issues. Given an input x i at time step i, GRU first calculates the reset gate si and the update gate zi to determine how much of the previous memory to keep and to forget,

si = σ(Wsx i + Ushi 1), zi = σ(Wzx i + Uzhi 1) (2)

where σ represented the sigmoid activation function and all W and U are tranable parameters. And then, the hidden state ht is updated by

hi = tanh(Whx i + si (Whhi 1)), hi = (1 zi) hi + zi hi 1 (3)

3.2 LAYERWISE RECURRENT ROUTER

Existing routers work independently, this lack of global information may prevent routers from discovering more effective token-expert combinations. Therefore, we integrate a GRU into the routing process, explicitly incorporating historical routing information into the current expert selection for each token. Formally, at the i th layer, we first use a linear layer to project the hidden state xi to the dimension of the GRU state x i Rp (usually smaller than the dimension h of xi. We choose 128 for most of the settings provide further analysis in Tab. 6 and Tab. 7): x i = Proji(xi) (4)

Published as a conference paper at ICLR 2025

Importantly, we use separate projectors for each layer since the hidden states x of different layers vary greatly (more discussion in Sec. 5). This projection output x , along with the GRU result from the previous layer, hi 1, is then fed into a GRU unit to obtain the current GRU output hi.

hi = GRU(x i, hi 1). (5)

Next, hi is input into the router and then expert outputs are aggregated based on the router output:

n N gn(hi; Gi, k)En(xi). (6)

Here, yi represents the output of the i-th layer, hi is the GRU output, gn(hi; Gi, k) is the router output computed with routing parameter Gi in layer i. Notice that, unlike traditional RNNs, which use a shared projector together for sequential inputs when the input dimension isn t equal to the RNN s hidden dimension, we use different projectors Proji in Eq. 4 for different layers since hidden states and model weights in different layers usually various a lot (Fig. 11 and Tab. 6).

Despite capturing inter-layer dependencies between routers in different layers, RMo E potentially has other advantages: (1) Prevent representation collapse: Chi et al. (2022) identified that the single linear layer routers encourage token embeddings clustering around expert embedding, implying a trend toward representation collapse issue. And they propose XMo E to first project hidden states into a low-dimension and then calculate the gating score. Similarly, the projector (Eq. 4) and GRU (Eq. 6) in RMo E also separate hidden states from expert embeddings and can reduce this issue. (2) Additional Gradient Flow: Before the inclusion of GRU, the router s gradient mainly derive from the expert weight score gn in Eq. 1. The introduction of GRU not only provides enriched information about historical routing but also an extra gradient propagation through GRU hidden states. We denote this extra gradient flow as Recurrent Gradient, and we empirically demonstrated that this Recurrent Gradient is important to RMo E. (3) Applicable with other Mo E design: the proposed method introduces an additional computation stage into SMo E, it is orthogonal to most existing attempts to improve Mo E and is seamlessly compatible with them.

4 EXPERIMENTS

4.1 EXPERIMENTAL SETTINGS

Langauge Modeling Tasks and Metrics Following (Pham et al., 2024), we first test on two common language modeling tasks: enwiki8 (character-level language modeling, with Bits-Per-Character (BPC) as the evaluation metrics) and Wiki Text-103 (word-level language modeling, with Perplexity (PPL) as the evaluation metrics). We employ default train-validation-test splits for each dataset. We report test performances of the best validation checkpoints. More details can be found in App. A.2.

Configurations and Baselines We compare RMo E with other existing router designs. All methods are based on the decoder-only standard switch-transformer architecture with post-norm. Following (Pham et al., 2024), all routers select top-2 experts from 16 experts. Each task is trained on 2 NVIDIA A100 GPUs for about 20 hours. More training configurations can be found in App. A.2. Our baselines include (1) SMo E: standard switch-transformers with a standard linear router. (2) Hyper Mo E (Do et al., 2023): the method employs a fixed, randomly initialized hypernetwork (Ha et al., 2016) to produce the weights for the linear router, subsequently allowing the generated linear layer to perform the routing. (3) SMo E-MLP (Shen et al., 2023): it replaces the linear router with a two-layer MLP using the GELU activation function. (4) Random Mo E: inspired by SMo EDropout (Chen et al., 2023) and Hyper Mo E, we propose to compare with a fixed randomly initialized linear router; this could be a naive baseline for all learnable routers. (5) XMo E (Chi et al., 2022): it first down-projects the token embeddings to a lower dimension (default 16) and computes its cosine-similarity with the low-dimension expert embeddings. It also uses a learnable temperature in softmax. (5) Cosine SMo E, similar to XMo E except without down-projection.

Pre-Training and SFT paradigm As pre-training-then-supervised-fine-tuning has become the standard paradigm, we also evaluate the RMo E in this setting. We conduct preliminary scale-up experiment on a setting of training 0.91B models with 40B tokens. Our pre-training corpus is a multilingual data collection that spans common and specialized domains, including Wikipedia, finance,

Published as a conference paper at ICLR 2025

Table 1: Performance of RMo E and baselines on two language modeling tasks, Enwiki8 and Wiki Text-103. Params means the non-embedding model parameters and (router parameters). Notice we don t separate unlearnable parameters in Hyper Mo E and Random SMo E. Mem means the peak GPU memory usage with the same batch-size configurations. Speed is the average time for 1k training steps. Results demonstrate that the RMo E outperforms baseline models and achieves comparable memory usage and speed as the standard SMo E.

Algorithm Enwiki8 (BPC) Wiki Text (PPL) Params Mem Speed

val test val test (M) (GB) (s/1k steps)

SMo E 1.152 1.128 31.279 33.061 36.08 (0.04) 47.92 960.2 Hyper Mo E 1.162 1.139 31.918 33.374 48.41 (12.4) 49.69 962.0 SMo E-MLP 1.164 1.137 31.430 33.142 36.79 (0.75) 48.70 964.1 Random SMo E 1.163 1.135 31.938 33.410 36.08 (0.04) 47.72 961.4 Cosine Mo E 1.148 1.122 31.367 33.047 36.08 (0.04) 48.68 962.4 XMo E 1.150 1.125 31.265 32.926 36.13 (0.09) 48.70 967.5 RMo E 1.141 1.116 30.939 32.867 36.51 (0.47) 49.46 972.9

and legal texts. Our model architecture is modified based on Llama family (Touvron et al., 2023). Specifically, we use a 24-layer model and top-4 gating from 16 experts per layer following (Dai et al., 2024). This yields a model with approximately 0.53B activated / 0.91B total parameters. All different routers use the same training configurations. To ensure expert load balance, we employ balance loss with weights 0.01 during training. These experiments are conducted using the Megablocks (Gale et al., 2023) on 8 NVIDIA A100 GPUs for about 5 days. More details can be found in App. A.2. After pertaining, we perform supervised fine-tuning (sft). All models are trained on Alpaca (Taori et al., 2023) with the same configuration. We use lm-evaluation-harness1 to evaluate the fine-tuned model. To simulate the real LLMs application scenario, we don t perform task-specific fine-tuning and evaluation. Since the models are largely under-trained, they give almost random-guessing results on challenging tasks like MMLU (Hendrycks et al., 2020). Therefore, we only test on tasks (ARC-easy, Hellaswag, PIQA, Sci Q, LAMBADA) in lm-evaluation-harness. More details about sft configurations and tasks can be found in App. A.2. We further justify the scalability of RMo E on the setting of training 15B activate 2.7B models with 120B / 400B tokens. Given our utilization of a high-quality pre-training corpus, pre-training on 400B tokens yields better results compared to experimental Mo E like Open Mo E (Xue et al., 2024). We find RMo E consistently provides over a one-point improvement in performance on benchmarks such as MMLU, GSM8K, and Human Eval. More details can be found in App. A.3.

4.2 MAIN RESULTS

Table 2: Performance of combining layer-wise recurrent routing mechanism with XMo E.

Algorithm Enwiki8 (BPC) Wiki Text (PPL)

Val test val test

XMo E (8) 1.160 1.132 31.74 33.55 + GRU router 1.150 1.124 31.34 32.99 XMo E (16) 1.150 1.125 31.27 32.93 + GRU router 1.144 1.119 31.15 32.47 XMo E (32) 1.140 1.114 31.30 32.71 + GRU router 1.136 1.112 31.25 32.55

Tab. 1 shows the performance of RMo E and selected baselines on two language modelling tasks. Our observations are as follows: (1) RMo E performs best on validation and test sets of two tasks, and the recurrent routing mechanism and the introduction of extra GRU block do not severely impact the training speed and memory usage, making RMo E more practical. (2) Comparing SMo E-MLP and SMo E, we find that replacing the original simple linear layer with a more capable MLP does not improve performance. It even underperforms the fixed random routing (Random Mo E) on Enwikik8, suggesting that naively increasing model capacity can t result in a more powerful router. Furthermore, since RMo E introduces novel computation stages in routing and is orthogonal to most existing router designs, it can easily be combined with them. Tab. 2 showcases the performance of the original XMo E and XMo E with GRU router in different XMo E lower dimensions (8, 16, and 32). We observe that the GRU router benefits all of the 3 configurations of XMo E.

While previous work on improving routers has not mostly been evaluated on large-scale pretraining (Dai et al., 2022; Chi et al., 2022; Do et al., 2023), we scale up RMo E to billion-level parameters and training tokens. We report SMo E and RMo E s evaluation results (both directly evaluated and evaluated after supervised fine-tuning (sft)) in Tab. 3. Existing works suggest freezing the router during SMo E tuning Zoph et al. (2022); we report SMo E s results under freeze and unfreeze settings. Correspondingly, for RMo E, we freeze the GRU and the linear layer under the freeze setting. From Tab. 3, we can observe that (1) Even in large-scale pre-training that requires more

1https://github.com/Eleuther AI/lm-evaluation-harness

Published as a conference paper at ICLR 2025

Table 3: SMo E and RMo E s pre-training costs and evaluation results in selected informative lm-evaluationharness tasks. sft means supervised fine-tuning on the Alpaca dataset. The task names and metrics for short names in the table are: ARC-e for ARC-Easy, acc; Hella is for Hellaswag, acc-norm; Piqa for PIQA, acc-norm; Lamb for LAMBADA, acc. Each model has approximately 0.53B activated parameters out-of 0.91B parameters. RMo E introduces about 3.5M additional parameters relative to SMo E.

Algorithm Training ARC-e Hella Piqa Sciq Lamb Avg

Speed: 48.87 s/step Mem: 48.00 GB

pre-train 20B tokens 47.14 35.51 64.69 76.2 14.61 47.63 +sft 50.93 35.82 65.61 74.7 17.81 48.97 +sft (freeze router) 50.59 35.78 66.32 74.7 18.18 49.11

pre-train 40B tokens 52.57 40.85 67.74 83.4 26.74 54.26 +sft 53.70 42.07 68.61 83.5 32.80 56.13 +sft (freeze router) 53.45 41.94 68.88 83.1 32.06 55.88

Speed: 49.07 s/step Mem: 48.69 GB

pre-train 20B tokens 47.01 35.91 65.23 78.7 19.13 49.20 +sft 48.53 36.90 66.21 79.6 24.74 51.20 +sft (freeze router) 49.24 36.79 66.16 79.7 24.32 51.24

pre-train 40B tokens 51.18 41.38 67.79 83.6 32.58 55.31 +sft 53.20 43.05 68.55 83.8 37.16 57.15 +sft (freeze router) 53.11 43.16 68.77 82.8 37.57 57.08

complex parallel training strategies, RMo E brings negligible wall time and memory cost compared with vanilla SMo E. (2) In comparable settings (e.g., the same number of tokens and with/without sft), RMo E outperforms SMo E, and even the best results of SMo E are lower than those of RMo E.

5 ABLATION STUDIES

Table 4: Enwiki8 validation and test BPC for different routing designs. NP stands for not passing recurrent states cross-layer. RMo E+NP has the same parameters and FLOPs as RMo E .

Algorithm Val Test Paras (M) Val Test Paras (M)

Small Medium

SMo E 1.214 1.184 15.32 1.152 1.128 36.08 SMo E + MLP 1.214 1.183 15.73 1.164 1.137 36.79 RMo E + NP 1.227 1.196 15.61 1.150 1.123 36.51 RMo E 1.213 1.183 15.61 1.141 1.116 36.51

Which contributes more? More Router parameters or layerwise recurrence. A straightforward reason for RMo E improvement could be that RMo E introduces additional computation and parameters. To disentangle the effect of introducing more router parameters and layerwise recurrence, we consider the following two extra settings: (1) SMo E+MLP: we naively increase the router parameters by replacing the original linear layer with a larger MLP layer; (2) RMo E + NP: we change Eq. 5 to GRU(ri, h0) to cancel the layerwise recurrence of RMo E, rendering a stateless GRU. The setting has the same parameters and computation as RMo E. From Tab. 4, we can observe that (1) in our setting, introducing larger routers in SMo E doesn t bring improvement (SMo E v.s. SMo E + MLP). (2) When ablated on the layerwise recurrence in RMo E, the performance largely drops, even worse than SMo E. Both results suggest that the layerwise recurrence is the main contributor.

Table 5: Enwiki8 validation and test BPC. detach hi 1 means detaching the recurrent hidden states before passing it to the next block. r-0.5/1.0 means passing the routing logits of the previous block to the current block. detach-r means detaching the gradient computation of passed logits.

Algorithm Val Test

SMo E 1.152 1.128

RMo E 1.141 1.116 + NP 1.150 1.123 + detach hi 1 1.159 1.133

+ NP + r-0.5 1.149 1.124 + NP + r-1.0 1.150 1.124 + NP + r-0.5 + detach-r 1.157 1.133 + NP + r-1.0 + detach-r 1.152 1.126

Recurrent Gradient is important to RMo E Following the aforementioned analysis, we try to further disentangle the effect of the layerwise recurrence. When removing the layerwise recurrence as in the RMo E + NP setting, we remove two information flows across layers: (1) the forward information about previous routers decisions and (2) the backward gradient propagation through GRU hidden states in different layers. To compare the two information flows, we investigate the following settings: (1) RMo E + detach hi 1: in intermediate stage between RMo E and RMo ENP. By detaching hi 1 to stop its gradient computation in Eq. 5, each GRU cell can only use previous information during feed-forward. (2) RMo E + NP + r-α: inspired by Realformer (He et al., 2020) that introduces residual attention score to facilitate attention gradient back-propagation, we investigate an intermediate stage between RMo E and RMo E-NP by adding gating logits residual for the RMo E + NP settings. Concretely, the gating score of i-th layer for expert n is

Published as a conference paper at ICLR 2025

gn(hi; Gi, k) + αgn(hi 1; Gi 1, k). It is a straightforward way to supplement router information across layers based on the NP setting. In our experiments, we set α as 0.5 and 1.0. (3) Moreover, we also test detaching the gradient computation of passed logits (hi 1 Gi 1), denoted as detach-r . From Tab. 5, RMo E + detach hi 1 performs even worse than RMo E-NP, showing that the Recurrent Gradient is important. Similarly, NP+r0.5 and NP+r1.0 are comparable with NP , showing that the naive gating score residual can t provide effective cross-layer information. The performance of their detached version largely drops, demonstrating the importance of extra gradient passing.

5 10 15 20 25 30 Model Size (Layer)

Test Results (BPC)

Test BPC vs Layer

SMo E RMo E RMo E-NP RMo E-NP-r0.5

Figure 2: Test BPC on Enwiki8 with different model sizes (6, 12, 18, 24, 32). Similar validation results are in App. A.5 Fig. 14

To further validate the gradient passing hypothesis, we test NP and NP-r0.5/1.0 on deeper models. The results are summarized in Fig. 2. As the layer increases, we can observe that (1) RMo E consistently outperforms other settings, and RMo E-NP even lags behind SMo E. The possible reason is, without passing recurrent states, RMo E-NP is similar to SMo E-MLP which simply increases router complexity but doesn t refine the router training. (2) RMo E-NP-r0.5 surpasses RMo E-NP, further emphasizing that SMo E s optimization benefits from the added additional gradient flow for routers. The spirit echoes the principles behind residual network, where residual connection are used to create direct paths for gradient propagation, thereby mitigating gradient vanishing as lerys deepen. Similarly, the GRU and the direct logits passing help for gradient flow of routers in deep layers. Ad shown in the Fig. 2, as the layer increases. the performance gaps between them may becomes more significant (3) While providing additional gradient across layers, RMo E-NP-r0.5 underperforms RMo E. This may because the indexes of experts in layer i are not aligned with those in other layers, directly adding logits can lead to improper constraints and hurt the model performance, further highlighting that RMo E adds flexible while informative pathways in the SMo E framework.

Table 6: Ablation of RMo E design. L-proj means the layerwise projector in Equ 4, S-proj is the standard RNN projector. SMo E + L-proj + GRU router is our proposed used RMo E method.

Algorithm Enwiki8 (BPC) Wiki Text (PPL) Params Val test val test

SMo E 1.152 1.128 31.28 33.06 36.08 SMo E + L-proj + GRU router 1.141 1.116 30.93 32.86 36.50 SMo E + S-proj + GRU router 1.148 1.123 31.15 33.02 36.23 SMo E + L-proj + RNN router 1.145 1.119 31.18 32.72 36.44 SMo E + L-proj + LSTM router 1.148 1.122 31.19 33.04 36.54

Table 7: Ablation of the recurrent design on large scale per-training setting. p is the dimension of the recurrent state ri in Eq. 4. We report averaged tasks (the same as Tab. 3) results for pre-trained and stf models. All models are trained with 20B tokens.

Algorithm Pretrain +sft

SMo E 47.63 49.11 RMo E (GRU, p = 128) 49.20 51.32 RMo E (GRU, p = 256) 49.08 50.04 RMo E (GRU, p = 512) 49.19 50.02 RMo E (RNN, p = 256) 47.92 50.44

Layerwise projector and suitable recurrent net bring the best results. This part tests the other components in RMo E, such as recurrent hidden state dimension, layerwise projector, and GRU cell. As shown in Tab. 6: (1) All methods with recurrent routers outperform SMo E. (2) Layerwise projector in Eq. 5 performs better than standard RNNs using a single shared projector. One possible reason is that the weights and hidden states norm in different layers vary greatly (as shown in App. A.4.5 Fig. 11), and it would be hard for a single shared projector to process them. This approach aligns with the design principle of not sharing Layer Norm parameters when employing shared Mo E transformer blocks, as discussed by Xue et al. (2022). (3) The GRU router performs best. Moreover, we further compare RMo E variants in the larger scale settings. We compare pre-trained models with different structures and recurrent hidden dimensions in Tab. 7 (Averaged results, full results in App. A.5 Tab. 12). We can find similar results: (1) All RMo E variants outperform SMo E; (2) Simple router (RNN) and complex routers (GRU with p = 256, 512) perform worse. In short, layerwise projector and moderate recurrent cell (e.g. GRU with p = 128) effectively introduce layerwise recurrent.

Published as a conference paper at ICLR 2025

0 1 2 3 4 5 6 7 Layer

0.163 0.187 0.165 0.189 0.164 0.195 0.159

0.163 0.210 0.162 0.205 0.169 0.206

0.167 0.188 0.165 0.194 0.161

0.160 0.217 0.171 0.207

0.158 0.193 0.154

0.171 0.203

(a) Mean MI of SMo E

0 1 2 3 4 5 6 7 Layer

0.165 0.155 0.207 0.230 0.160 0.219 0.099

0.261 0.417 0.451 0.311 0.409 0.156

0.346 0.398 0.307 0.336 0.116

0.606 0.458 0.555 0.183

0.498 0.607 0.217

0.483 0.142

(b) Mean MI of XMo E

0 1 2 3 4 5 6 7 Layer

0.045 0.074 0.077 0.083 0.112 0.165 0.194

0.049 0.079 0.111 0.114 0.135 0.192

0.093 0.114 0.101 0.133 0.196

0.131 0.183 0.206 0.269

0.191 0.241 0.330

0.300 0.442

(c) Mean MI of Hyper Mo E

0 1 2 3 4 5 6 7 Layer

0.904 0.992 0.892 0.992 0.902 1.003 0.898

0.895 1.113 0.890 1.120 0.911 1.119

0.884 0.987 0.892 0.996 0.887

0.878 1.113 0.894 1.113

0.886 0.989 0.884

0.903 1.114

(d) Mean MI of RMo E

0 1 2 3 4 5 6 7 Layer

0.469 0.508 0.467 0.510 0.468 0.517 0.462

0.469 0.506 0.464 0.515 0.467 0.519

0.466 0.507 0.473 0.511 0.469

0.457 0.518 0.462 0.513

0.463 0.514 0.468

0.473 0.512

(e) Mean MI of RMo E-NP

0 1 2 3 4 5 6 7 Layer

0.948 0.833 0.770 0.784 0.738 0.768 0.726

1.030 0.914 0.884 0.868 0.867 0.858

0.979 0.892 0.839 0.853 0.822

0.981 0.876 0.846 0.836

0.961 0.876 0.831

0.951 0.858

(f) Mean MI of RMo E-NP-r1.0

Figure 3: Heat maps of cross-layer mutual information (MI) for different methods. The (i-th row, j-th column) value represents MI between layers i and j. The First Row ((a) SMo E, (b) XMo E, (c) Hyper Mo E): All three methods have low cross-layer MI. Second Row((d) RMo E, (e) RMo E-NP, (f) RMo E-NP-r1.0): While RMo E has high cross-layer MI when disabled layerwise recurrent states passing, MI largely drops.

6 OBSERVATIONS

Layerwise recurrence increases cross-layer mutual information. The intuition of the proposed RMo E is that current routers in different layers are isolated, and the layerwise GRU is incorporated to provide routers with global information for coordination. Therefore, we measure the Mutual Information (MI) between routing distributions in different layers for each router in Fig. 3. The code can be found in App. A.4.2. We can observe: (1) Besides RMo E, all existing methods show low cross-layer MI, indicating that the routers of different layers work relatively independently. (2) RMo E shows higher MI than three baselines (d v.s. a, b, and c) and RMo E-NP (d v.s. e), showing the recurrent router can facilitate cross-layer information sharing. (3) While RMo E-NP s MI is largely smaller than RMo E, it still surpasses the three baseline methods. The reason can be the shared GRU in Eq. 5. (4) Intuitively, passing routing logits can directly improve MI (f v.s. e). However, directly passing logits can t ensure long-range information sharing, as the values in the right part of (f), which indicate the MI between non-neighbor layers, are smaller than those in (d).

RMo E enables moderate flat gating scores. The router s gating score is a noteworthy feature for Mo E-based models. It showcases the models training dynamics and how they ultimately exploit their experts. Ideally, the training paradigm of Mo E models may have two stages: exploration and then exploitation. i.e., the router should actively explore more new expert combinations at the early stage of learning. But if the gating score converges to a sharp distribution too early, the router will learn very shallow routing mechanisms and fail to find optimal routing decisions. So We record gate entropy for each token( (P

n gn ln gn), gn is the gating score for expert n) and plot the entropy distribution in Fig. 4 (left). Generally, the higher the entropy, the more evenly the router activates different experts rather than allowing one expert to dominate the layer. Thus, large density in highentropy parts means many recorded tokens have flat gating score distributions. We can observe that (1) Random Mo E, with a fixed random-initialized router, shows the largest gate entropy. Moreover, most tokens have high entropy, as there is only one peak in the large entropy location. This indicates while Random RMo E can highly encourage exploration, the router may be under-trained and lack exploitation. (2) SMo E and Hyper Mo E show low routing entropy, with many tokens having nearly zero entropy. Such low entropy means the softmax operation gives nearly one-hot results, which means the Top-k experts degrade to Top-1 and the router s gradient are very sparse. This can hurt the exploration of expert selection and lead to inefficient Top-k experts usage. (3) XMo E and

Published as a conference paper at ICLR 2025

Cosine Mo E, using cosine similarity, which normalized the input and weights G before computing logits, show relatively high entropy. They also perform better than SMo E in Tab. 1, indicating the benefits of suitable exploration. (4) RMo E, with unique cross-layer information sharing, has high entropy for many tokens while low entropy for a few tokens. These moderate gating scores can achieve a better balance between exploration and exploitation.

One may argue that such high entropy may come from the under-trained recurrent router in RMo E instead of capturing the dependency across layers, as the unlearnable Random Mo E also gives high entropy. Therefore, we further visualize the scores of RMo E-NP and RMo E-NP-r0.5/1.0 in Fig. 4 (right). The observations are: (1) RMo E-NP s entropy is slightly larger than SMo E s but largely smaller than RMo E s. , indicating that the larger entropy in RMo E is not from under-training but from cross-layer information sharing. (2) While RMo E-NP-r0.5 is larger than SMo E and smaller than RMo E, RMo E-NP-r1.0 is the largest. From Tab. 5 and Fig. 2, the small and large one both under-perform RMo E, These further demonstrate that the recurrent network can achieve a moderate flat gating score distribution, leading to a better trade-off between exploration and exploitation.

Gate Entropy

Random Mo E

Cosine Mo E

Density (%)

Entropy of Layer 5 / (7)

SMo E RMo E Hyper Mo E Random Mo E Cosine Mo E XMo E

Gate Entropy

RMo E-np-0.5

RMo E-np-1.0

Density (%)

Entropy of Layer 5 / (7)

SMo E RMo E RMo E-np RMo E-np-0.5 RMo E-np-1.0

Figure 4: Gate score entropy distribution over Enwiki8 test set for different router configurations. More similar results can be found in App. A.4.4 Fig. 8 and Fig. 9.

Table 8: Expert scores balance on Enwiki8. Inner Balance (IB) represents the (top-1 score / top-2 score) ratio, and Outer Balance (OB) represents summed selected gate scores.

Algorithm IB OB

SMo E 34.68 0.915 Hyper Mo E 34.60 0.920 Cosine Mo E 7.611 0.794 XMo E 19.93 0.861 Random Mo E 2.000 0.414

RMo E 2.021 0.573 + NP 16.58 0.842 + NP + r-0.5 2.792 0.661 + NP + r-1.0 1.147 0.212

We also look into the statistics of selected experts scores. Here we calculate the (1) Inner Balance (IB): defined as the ratio Top-1 score/Top-2 score, large IB means the first expert dominates all selected experts; and (2) Outer Balance (OB), defined as P

k Top-k gk, indicating the selected scores ratio in the score distribution, large OB means selected expert scores dominate the gate score distribution. Because such a ratio could have some extreme values, we report the median number for all tokens in Tab. 8. We can observe: (1) Random Mo E, with a fixed router, shows the lowest IB and OB. (2) Low-entropy models in the previous section (Sec. 6) have high IB and OB. (3) RMo E gives suitable IB and OB. While simply using a complex router ( RMo E-NP ) shows relatively low IB and OB, RMo E is even lower. Moreover, passing logits can reduce IB and OB ( RMo E+NP+r-0.5/1.0 ). All these experiments show sharing cross-layer router information can lead to more balanced routing decision and thus facilitate expert usage.

Layerwise recurrence reduces the negative effect of load balance constraint. To provide a more direct analysis of the router gradient, we investigate how the gradient norm of the router varies throughout the entire training process. When training a Mo E model, the gradient of the router has

Published as a conference paper at ICLR 2025

1k 2k 3k 4k 5k 6k 7k 8k 9k 10k Training Steps

Normed Expert Similarity Distribution

Comparison of Expert Similarities for SMo E and RMo E

Init Similarity SMo E-median RMo E-median SMo E-mean RMo E-mean

Figure 5: Experts similarity distribution across layers during large-scale pre-training. We plot box plots of expert similarity from checkpoints taken every 1k training steps (approximately 4B tokens), showing the expert similarity across the 24 layers of the model (with maximum, minimum, first quartile, median, and mean).

two separate sources: (1) the language modeling (LM) loss, and (2) the load balancing (LB) loss that pushes the router to assign tokens to different experts in a balanced manner. We empirically find (1) LB loss dominates the training of the linear router at the early training stage. This could hurt model s general performance, as Wang et al. (2024) find, a high LB loss can cause balance token distribution but reduce performance. (2) On the contrary, the gradient of the RNN router from LB loss stabilises in the early stage, and the gradient from the LM loss keeps decreasing, suggesting that the RNN router is more optimised towards the LM loss. These observations suggest the recurrent router can effectively controls the influence of the LB loss. More details can be found in App. A.4.1

Layerwise recurrence encourages expert diversity One intriguing feature of Mo E is that experts could modularly specialize on different inputs. Therefore, following recent works that analyze the FFNs (Geva et al., 2021; Qiu et al., 2024a;b) and expert weights similarity (Wu et al., 2022; Lo et al., 2024), we use the cosine-similarity of expert s parameters to measure the expert diversity . We calculate for SMo E and RMo E in the large-scale pre-training settings, and the results are shown in Fig. 5. To better understand the scale of similarity score, we also plot one dash line showing the similarity of random initialized experts. More details about similarity calculation and explanation can be found in App. A.4.3. We can observe that: (1) At the beginning of the training, the lowest expert similarities are similar to the random initialized one. (2) The expert similarity increases in the early training stages, then decreases later. This may be due to the randomly initialized router in the early stages, which essentially assigns tokens randomly to different experts, leading to increased expert similarity. As the router continues to learn, it gradually assigns specific tokens to the corresponding experts, resulting in decreased expert similarity as training progresses. (3) During the entire training stages, the average similarity score between experts in RMo E is lower than those in SMo E, indicating that RMo E encourages more diverse experts. This expert diversity also reasonably corresponds to the moderate flat gate scores in Sec 6.

7 CONCLUSION

This work introduces a layer-wise recurrent router for existing Mo E-based language models. We validate the effectiveness of this layer-wise recurrence across various settings, tasks, and model sizes. By adding a new yet efficient computation stage in the routing, RMo E stands orthogonal to most existing methods and can be flexibly integrated with them. Ablation studies reveal that this recurrent mechanism offers additional Recurrent Gradients, aiding router optimization. Further analysis validates our intuition that GRU facilitates inter-layer information sharing. We also systematically compare RMo E s model behavior with various baseline models, demonstrating that RMo E can enhance existing SMo E methods and providing insights for future research.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENT

Jie Fu is supported by Shanghai Artificial Intelligence Laboratory.

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. ar Xiv preprint ar Xiv:2401.02954, 2024.

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.

Tianlong Chen, Zhenyu Zhang, Ajay Jaiswal, Shiwei Liu, and Zhangyang Wang. Sparse moe as the new dropout: Scaling dense and self-slimmable transformers. ar Xiv preprint ar Xiv:2303.01610, 2023.

Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, Heyan Huang, and Furu Wei. On the representation collapse of sparse mixture of experts. In Neur IPS, 2022.

Peter Clark, Isaac Cowhey, Oren Etzioni, Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. In ar Xiv preprint ar Xiv:1803.05457, 2018.

Damai Dai, Li Dong, Shuming Ma, Bo Zheng, Zhifang Sui, Baobao Chang, and Furu Wei. Stablemoe: Stable routing strategy for mixture of experts. ar Xiv preprint ar Xiv:2204.08396, 2022.

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixtureof-experts language models. ar Xiv preprint ar Xiv:2401.06066, 2024.

Rahul Dey and Fathi M Salem. Gate-variants of gated recurrent unit (gru) neural networks. In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS), pp. 1597 1600. IEEE, 2017.

Yifeng Ding, Jiawei Liu, Yuxiang Wei, Terry Yue Zhuo, and Lingming Zhang. Xft: Unlocking the power of code instruction tuning by simply merging upcycled mixture-of-experts. ar Xiv preprint ar Xiv:2404.15247, 2024.

Giang Do, Khiem Le, Quang Pham, Trungtin Nguyen, Thanh-Nam Doan, Bint T Nguyen, Chenghao Liu, Savitha Ramasamy, Xiaoli Li, and Steven Hoi. Hyperrouter: Towards efficient training and inference of sparse mixture of experts. ar Xiv preprint ar Xiv:2312.07035, 2023.

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey on in-context learning. ar Xiv preprint ar Xiv:2301.00234, 2022.

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res., 23:120:1 120:39, 2022.

Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. Mega Blocks: Efficient Sparse Training with Mixture-of-Experts. Proceedings of Machine Learning and Systems, 5, 2023.

Published as a conference paper at ICLR 2025

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wentau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp. 5484 5495. Association for Computational Linguistics, 2021.

Zhuocheng Gong, Ang Lv, Jian Guan, Junxi Yan, Wei Wu, Huishuai Zhang, Minlie Huang, Dongyan Zhao, and Rui Yan. Mixture-of-modules: Reinventing transformers as dynamic assemblies of modules. ar Xiv preprint ar Xiv:2407.06677, 2024.

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming the rise of code intelligence. ar Xiv preprint ar Xiv:2401.14196, 2024.

Suchin Gururangan, Mike Lewis, Ari Holtzman, Noah A Smith, and Luke Zettlemoyer. Demix layers: Disentangling domains for modular language modeling. ar Xiv preprint ar Xiv:2108.05036, 2021.

Suchin Gururangan, Margaret Li, Mike Lewis, Weijia Shi, Tim Althoff, Noah A Smith, and Luke Zettlemoyer. Scaling expert language models with unsupervised domain discovery. ar Xiv preprint ar Xiv:2303.14177, 2023.

David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. ar Xiv preprint ar Xiv:1609.09106, 2016.

Ruining He, Anirudh Ravula, Bhargav Kanagal, and Joshua Ainslie. Realformer: Transformer likes residual attention. ar Xiv preprint ar Xiv:2012.11747, 2020.

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. ar Xiv preprint ar Xiv:2009.03300, 2020.

Quzhe Huang, Zhenwei An, Nan Zhuang, Mingxu Tao, Chen Zhang, Yang Jin, Kun Xu, Liwei Chen, Songfang Huang, and Yansong Feng. Harder tasks need more experts: Dynamic routing in moe models. ar Xiv preprint ar Xiv:2403.07652, 2024.

Shima Imani, Liang Du, and Harsh Shrivastava. Mathprompter: Mathematical reasoning using large language models. ar Xiv preprint ar Xiv:2303.05398, 2023.

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open Review.net, 2023.

Dmitry Lepikhin, Hyouk Joong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. In Proceedings of the 2021 International Conference on Learning Representations (ICLR), 2021.

Bo Li, Yifei Shen, Jingkang Yang, Yezhen Wang, Jiawei Ren, Tong Che, Jun Zhang, and Ziwei Liu. Sparse mixture-of-experts are domain generalizable learners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open Review.net, 2023.

Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A Smith, and Luke Zettlemoyer. Branch-train-merge: Embarrassingly parallel training of expert language models. ar Xiv preprint ar Xiv:2208.03306, 2022.

Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. Moe-llava: Mixture of experts for large vision-language models. ar Xiv preprint ar Xiv:2401.15947, 2024.

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. ar Xiv preprint ar Xiv:1806.09055, 2018.

Published as a conference paper at ICLR 2025

Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, and Jie Fu. A closer look into mixture-of-experts in large language models. ar Xiv preprint ar Xiv:2406.18219, 2024.

Larry R Medsker, Lakhmi Jain, et al. Recurrent neural networks. Design and Applications, 5(64-67): 2, 2001.

Xiaonan Nie, Xupeng Miao, Shijie Cao, Lingxiao Ma, Qibin Liu, Jilong Xue, Youshan Miao, Yi Liu, Zhi Yang, and Bin Cui. Evomoe: An evolutional mixture-of-experts training framework via dense-to-sparse gate. ar Xiv preprint ar Xiv:2112.14397, 2021.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 27730 27744, 2022.

Denis Paperno, Germ an Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016.

Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural architecture search via parameters sharing. In International conference on machine learning, pp. 4095 4104. PMLR, 2018.

Quang Pham, Giang Do, Huy Nguyen, Trung Tin Nguyen, Chenghao Liu, Mina Sartipi, Binh T Nguyen, Savitha Ramasamy, Xiaoli Li, Steven Hoi, et al. Competesmoe effective training of sparse mixture of experts via competition. ar Xiv preprint ar Xiv:2402.02526, 2024.

Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts. ar Xiv preprint ar Xiv:2308.00951, 2023.

Zihan Qiu, Zeyu Huang, and Jie Fu. Unlocking emergent modularity in large language models. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, June 2024a. Association for Computational Linguistics.

Zihan Qiu, Zeyu Huang, Youcheng Huang, and Jie Fu. Empirical study on updating key-value memories in transformer feed-forward layers. ar Xiv preprint ar Xiv:2402.12233, 2024b.

Samyam Rajbhandari, Conglong Li, Z. Yao, Minjia Zhang, Reza Yazdani Aminabadi, A. Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. Ar Xiv, abs/2201.05596, 2022.

Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. ar Xiv preprint ar Xiv:1710.05941, 2017.

Stephen Roller, Sainbayar Sukhbaatar, Jason Weston, et al. Hash layers for large sparse models. Advances in Neural Information Processing Systems, 34:17555 17566, 2021.

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017.

Yikang Shen, Zheyu Zhang, Tianyou Cao, Shawn Tan, Zhenfang Chen, and Chuang Gan. Moduleformer: Learning modular large language models from uncurated data. Co RR, abs/2306.04640, 2023.

Samuel L. Smith, Ananya Kumar Ram, James Bradbury, Sharan Narang, Jared Casper, Matthew Johnson, Anselm Levskaya, John Schulman, Jascha Sohl-Dickstein, and Barret Zoph. Using megablocks to scale language model training. In International Conference on Machine Learning, pp. 20275 20291. PMLR, 2022.

Published as a conference paper at ICLR 2025

Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozi ere, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, et al. Branch-train-mix: Mixing expert llms into a mixture-of-experts llm. ar Xiv preprint ar Xiv:2403.07816, 2024.

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ee Lacroix, Baptiste Rozi ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. URL https://arxiv.org/abs/2302.13971.

Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts. ar Xiv preprint ar Xiv:2408.15664, 2024.

Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. Constructing datasets for multi-hop reading comprehension across documents. In ar Xiv preprint ar Xiv:1710.06481, 2017.

Haoze Wu, Zihan Qiu, Zili Wang, Hang Zhao, and Jie Fu. Gw-moe: Resolving uncertainty in moe router with global workspace theory. ar Xiv preprint ar Xiv:2406.12375, 2024.

Lemeng Wu, Mengchen Liu, Yinpeng Chen, Dongdong Chen, Xiyang Dai, and Lu Yuan. Residual mixture of experts. ar Xiv preprint ar Xiv:2204.09636, 2022.

Fuzhao Xue, Ziji Shi, Futao Wei, Yuxuan Lou, Yong Liu, and Yang You. Go wider instead of deeper. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 8779 8787, 2022.

Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. Openmoe: An early effort on open mixture-of-experts language models. ar Xiv preprint ar Xiv:2402.01739, 2024.

Yuanhang Yang, Shiyi Qi, Wenchao Gu, Chaozheng Wang, Cuiyun Gao, and Zenglin Xu. Enhancing efficiency in sparse models with sparser selection. ar Xiv preprint ar Xiv:2403.18926, 2024.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.

Xiaofeng Zhang, Yikang Shen, Zeyu Huang, Jie Zhou, Wenge Rong, and Zhang Xiong. Mixture of attention heads: Selecting attention heads per token. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp. 4150 4162. Association for Computational Linguistics, 2022.

Hao Zhao, Zihan Qiu, Huijia Wu, Zili Wang, Zhaofeng He, and Jie Fu. Hypermoe: Towards better mixture of experts via transferring among experts. ar Xiv preprint ar Xiv:2402.12656, 2024.

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103 7114, 2022.

Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. ar Xiv preprint ar Xiv:1611.01578, 2016.

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. St-moe: Designing stable and transferable sparse expert models. ar Xiv preprint ar Xiv:2202.08906, 2022.

Simiao Zuo, Xiaodong Liu, Jian Jiao, Young Jin Kim, Hany Hassan, Ruofei Zhang, Tuo Zhao, and Jianfeng Gao. Taming sparsely activated transformer with stochastic experts. ar Xiv preprint ar Xiv:2110.04260, 2021.

Published as a conference paper at ICLR 2025

A.1 MORE RELATED WORKS

Routing Strategies While most Mo E works follow the original success and use token choice routing, some works explore different routing approaches. In Expert-Choice Routing (Zhou et al., 2022), each expert selects tokens to process across the whole batch input. This method avoids expert imbalance issues and allows different tokens to be processed by a flexible number of experts. Soft Mixture-of-Experts (Puigcerver et al., 2023) further assigns token weights for input tokens, weighted-averages them, and passes these merged tokens to different experts. This method moves one step behind the Expert-Choice Routing to allow more precise control. However, their tokenselecting operations are non-causal and thus can t be directly used in the decoder models. Recent works (Huang et al., 2024; Yang et al., 2024) introduce dynamic top-k for each input token. While the FLOPs can be reduced, since this dynamic assignment can hurt the parallel computation of experts, more system-level implementation must be optimized to achieve wall-time efficiency. Some works also analyze issues in the routing of standard Mo E like uncertain tokens (Wu et al., 2024) and lack of expert knowledge transfer (Zhao et al., 2024).

Training Strategies Due to the unstable nature of Mo E (Zoph et al., 2022), some works investigate special training strategies for Mo E. Evo Mo E (Nie et al., 2021) uses a large top-k (even equal to the expert number) at the beginning of training, gradually decreasing k. Stable Mo E (Dai et al., 2022) proposes to freeze the router after training some tokens to avoid token assignment conflicts. Residual Mixture of Experts (Wu et al., 2022) initializes Mo E from dense training checkpoints and finds it is an efficient method to train Mo E models. Later, sparse-upcycling (Komatsuzaki et al., 2023) further trains large-scale language models from dense checkpoints, and many works follow this paradigm to efficiently utilize the power of Mo E in fine tuning (Li et al., 2023), instruction tuning (Lin et al., 2024), and visual instruction tuning (Ding et al., 2024). Different from directly training Mo E models, some works continue training the same pre-trained model on several different datasets to encourage specialization and combine them, either merging them into an Mo E-style model (Gururangan et al., 2021; Sukhbaatar et al., 2024) or keeping a group of models and introducing a model-level router (Li et al., 2022; Gururangan et al., 2023).

Recurrence Controller A series of works introduce recurrent networks for Neural Architecture Search (NAS) (Zoph & Le, 2016; Ramachandran et al., 2017; Pham et al., 2018; Liu et al., 2018). They introduce a recurrent controller network that predicts the current layer-i s architecture (like CNN filters number, size, and stride) based on layer-i s input hidden states and previous recurrent states (Zoph & Le, 2016). While these works use RNN to predict model architecture configurations of each layer for all inputs, RMo E utilizes RNN to help the router select expert combinations for each token, which can be viewed as a dynamic version of NAS.

A.2 EXPERIMENT SETUP

Enwiki8 and Wiki Text-103 We follow the default configurations in Compete SMo E (Pham et al., 2024). Each model is trained for 80,000 steps with Adam optimizer. The learning rate is 0.0007 with 4000 warmup steps, and the batch size is 48. The main used model is a decoder-only transformerbased architecture with 8 layers and a hidden size of 352. It includes 16 experts, where the top 2 are selected during computation, each with an expert size of 352. The model uses 8 attention heads and handles sequences up to 512 tokens in length, with an attention span of 2048 tokens. It incorporates a dropout rate of 0.1 and a load balancing factor of 0.01 to ensure an even distribution of expert utilization. Computation Cost Each 8-layer model is trained on one NVIDIA-A100 GPU for approximately 21 hours.

Large Scale Pre-training For model architecture, our 24-layer model employs Rotary Embedding for positional encoding, Swi GLU for activation functions, and RMSNorm to enhance the model s efficiency and performance. Other model configuration includes a hidden size of 1280, 20 attention heads, an initialization method standard deviation of 0.02, a sequence length of 4096, and a maximum positional embedding length of 4096. All dropout rates are set to 0. For the Mo E part, we use 16 experts, with each expert having a feedforward network hidden size of 448, following

Published as a conference paper at ICLR 2025

the fine-grained Mo E settings, and each token activating 4 experts. We use a tokenizer with a 96512 vocabulary size, which adds approximately 123M embedding parameters and 123M vocabulary projection head parameters. Under this configuration, each model has approximately 664M non-embedding parameters, and every token activates 334M non-embedding parameters. The total parameter is around 910M. For pre-training configurations, we use a global batch size of 1120, a warmup period of 2000 iterations, a learning rate of 4.2e-4, a minimum learning rate of 4.2e-5, cosine learning rate decay, Adam optimizer with β1 = 0.9 and β2 = 0.95, a weight decay of 0.1, and gradient clipping at 1.0. Computation Cost Each 24-layer model is trained on 8 NVIDIA-A100 GPUs for approximately 5 days.

Instruction Tuning Data The Alpaca (Taori et al., 2023) dataset is an open-source instructionfollowing dataset created by Stanford researchers, inspired by Open AI s Chat GPT. The dataset consists of 52,000 instruction-response pairs generated using the text-davinci-003 model by providing diverse and comprehensive instructions and recording the corresponding responses. It is designed to facilitate the training and evaluation of models in understanding and generating human-like text responses to various instructions.

Instruction Tuning Setting We use the codebase2 and corresponding default configurations. More concretely, we use bfloat16 (bf16) precision to accelerate training while maintaining numerical stability. The model is trained for 3 epochs using Adam W optimizer with a global batch size 128. We set the learning rate to 2e-5 and do not apply weight decay. A warmup ratio of 0.03 is used to gradually increase the learning rate at the beginning of training, and we utilize a cosine learning rate scheduler to adjust it throughout the training process, promoting smoother convergence. Computation Cost Each is trained on 8 NVIDIA-A100 GPUs for approximately 2 hours.

Evaluation Tasks Here we shortly describe our used evaluation datasets:

ARC-Easy is a subset of the AI2 Reasoning Challenge (ARC) dataset (Clark et al., 2018). It consists of multiple-choice questions from elementary and middle school science exams that are relatively easier than the ARC-Challenge set. These questions require basic reasoning and knowledge application.

Hellaswag (Zellers et al., 2019) is a dataset designed for commonsense reasoning and narrative prediction. It involves choosing the most plausible continuation of a given scenario from multiple options. The task is challenging because it requires understanding and applying common sense knowledge.

PIQA (Bisk et al., 2020) dataset tests a model s ability to understand and reason about physical interactions and affordances. The task involves selecting the correct answer to questions about everyday physical activities.

Sci Q (Welbl et al., 2017) is a dataset of science questions that includes multiple-choice and directanswer formats. It aims to test a model s ability to understand and reason with scientific concepts typically taught at the school level.

LAMBADA (Paperno et al., 2016) is a dataset designed for language modeling and comprehension. The task involves predicting the last word of a given passage, which requires a deep understanding of the context provided by the preceding text.

A.3 FURTHER PRETRAINING VALIDATION

To further validate the scalability of RMo E, we conduct experiments with larger model sizes and increased pre-training corpus. Both Mo E models followed the design principles of Deep Seek Mo E (Dai et al., 2024), utilizing fine-grained experts and shared experts to maintain strong baselines. We evaluated the models on more challenging benchmarks, including Hellaswag, MMLU, GSM8K, and Human Eval, to assess their language capabilities, multi-domain knowledge, mathematical skills, and coding abilities. Additionally, we tested the models perplexity on multiple domain test datasets and reported the average results.

2https://github.com/tatsu-lab/stanford alpaca

Published as a conference paper at ICLR 2025

Tab. 9 and Tab. 10 present the performance of a 15-billion parameter model with 2.7 billion activated experts, trained on 120 billion and 400 billion tokens, respectively. The results show that RMo E consistently delivers improvements even with increased data volumes. The findings indicate that RMo E enhances performance in standard language modeling tasks, such as Hellaswag and PPL, and improves performance on more complex reasoning tasks.

Table 9: Performance comparison of SMo E, SMo E-MLP and RMo E at the model scale of 15B activation 2.7B parameters, training 120B tokens.

Hellaswag MMLU GSM8K Avg PPL

Pretrain 80B Tokens

SMo E 67.69 46.24 24.18 7.406 SMo E-MLP 67.98 46.47 23.58 7.437 RMo E 68.00 47.74 27.14 7.361

Pretrain 100B Tokens

SMo E 70.98 50.61 30.78 6.754 SMo E-MLP 70.8 50.6 30.17 6.786 RMo E 71.02 51.74 32.98 6.732

Pretrain 120B Tokens

SMo E 72.03 52.79 34.8 6.447 SMo E-MLP 72.19 52.81 34.57 6.479 RMo E 72.36 54.02 36.13 6.425

Table 10: Performance comparison of SMo E, SMo E-MLP and RMo E at the model scale of 15B activation 2.7B parameters, training 400B tokens.

Hellaswag MMLU GSM8K Avg PPL

Pretrain 200B Tokens

SMo E 69.48 49.96 33.21 7.718 SMo E-MLP 69.76 50.27 31.77 7.736 RMo E 70.00 52.21 32.98 7.608

Pretrain 280B Tokens

SMo E 72.40 54.66 42.61 6.477 SMo E-MLP 72.62 55.33 38.51 6.502 RMo E 73.18 56.06 44.35 6.400

Pretrain 400B Tokens

SMo E 76.39 59.54 52.16 5.685 SMo E-MLP 76.09 59.96 51.71 5.709 RMo E 76.72 60.60 52.99 5.620

A.4 ADDITIONAL OBSERVATIONS

Published as a conference paper at ICLR 2025

A.4.1 ROUTER GRADIENT NORM AND DROP RATIO

Table 11: Comparison of linear and RNN routers in terms of gradients and drop ratios at various training steps. We record the router gradient every 10k training steps (20B tokens). We compute the gradient with language modeling (LM) loss and load balance (LB) loss. Drop ratio is the ratio of dropped tokens and all tokens as we assign capacity factor 1.0 for each expert.

Training steps (k step) 0.1 10 20 30 40 50 60

Linear router

grad from the whole loss 1.058 0.194 0.1911 0.198 0.208 0.217 0.221 grad from LM loss 0.625 0.183 0.184 0.192 0.204 0.215 0.220 grad from LB loss 0.433 0.011 0.008 0.006 0.004 0.002 0.001 drop ratio 35.6 5.43 5.34 5.17 4.89 4.64 4.50

grad from the whole loss 0.972 0.160 0.153 0.153 0.155 0.155 0.154 grad from LM loss 0.636 0.146 0.138 0.139 0.144 0.148 0.151 grad from LB loss 0.337 0.014 0.015 0.014 0.011 0.007 0.003 drop ratio 38.7 6.35 6.30 5.94 5.32 4.54 4.09

Based on the setting of training 15B models for 120B tokens, we investigate how the gradient norm of the router varies throughout the entire training process. When training an Mo E-based model, the gradient of the router has two separate sources: due to (1) the language modeling (LM) loss, and (2) the load balancing (LB) loss that forces the router to assign tokens to different experts in a balanced manner. Therefore, for each router, we compare the gradient from the LM loss only and from the whole training loss. We calculate the average for 100 training steps to estimate the gradient norm.

Furthermore, to better investigate the relation between the router behavior and the router gradient, we calculate the drop ratio for the router. This is because during the large-scale Mo E pre-training, to ensure the training efficiency, the expert is usually controlled by an hyper-parameter called capacity factor, which determines the total tokens that one expert can process. If the router assigns tokens to some expert that exceeds its capacity, the expert will drop tokens with the lowest scores. And we define the drop ratio as tokens dropped / total tokens. The LB loss mentioned before is critical to decreasing the drop ratio.

According to Tab. 11, we have the following observations: 1. The gradient norm of the RNN router is generally smaller than that of the linear router. And for both routers, the drop ratio decreases with the training. 2. According to the drop ratios, we observe the significant behavioral difference between the two routers: during the early training phase (10k steps - 30k steps), the drop ratio of the linear router is noticeably lower than that of the RNN router; the drop ratio of the RNN router archives at the lower value in the end. 3. The trend observed in the drop ratio is consistent with the results of the gradient norm. The grad norm for LB loss is relatively higher in the RNN router until the final training stage (50k - 60k), whereas the gradient from LB loss in the linear router is high at the beginning and generally low during the later part of training (10k - 60k).

These phenomena indicate that the LB loss could dominate the training of the linear router: when the drop ratio is low and stays unchanged, the grad from LB loss will be low because the router is already well-optimized for LB loss. Such early convergence in the LB loss may reach a suboptimal solution in the trade-off between optimizing load balance and language modeling. On the contrary, the gradient of the RNN router from LB loss stabilizes in the early training steps (10k - 30k), and the gradient from the lm loss keeps decreasing, suggesting that the RNN router is more optimized towards the LM loss.

Published as a conference paper at ICLR 2025

A.4.2 MUTUAL INFORMATION

import numpy as np from s k l e a r n . m e t r i c s import m u t u a l i n f o s c o r e

def d i s c r e t i z e p r o b d i s t ( p r o b d i s t , bins =100): D i s c r e t i z e the p r o b a b i l i t y d i s t r i b u t i o n i n t o d i s c r e t e bins . d i s c r e t i z e d = np . d i g i t i z e ( p r o b d i s t , bins =np . l i n s p a c e (0 , 1 , bins ) ) r e t u r n d i s c r e t i z e d

def c a l c u l a t e m u t u a l i n f o r m a t i o n ( x1 , x2 , bins =100): C a l c u l a t e mutual i n f o r m a t i o n between each p a i r of d i s t r i b u t i o n s in x1 and x2 . x1 , x2 : numpy a r r a y s of shape (N, 16) bins : number of bins to use f o r d i s c r e t i z a t i o n Returns a numpy a r r a y of mutual i n f o r m a t i o n values . mi values = [ ] f o r i in range ( x1 . shape [ 0 ] ) : x 1 d i s c r e t i z e d = d i s c r e t i z e p r o b d i s t ( x1 [ i ] , bins ) x 2 d i s c r e t i z e d = d i s c r e t i z e p r o b d i s t ( x2 [ i ] , bins ) mi = m u t u a l i n f o s c o r e ( x 1 d i s c r e t i z e d , x 2 d i s c r e t i z e d ) mi values . append ( mi ) r e t u r n np . a r r a y ( mi values )

A.4.3 EXPERT SIMILARITIES

def g e t s i m i l a r i t i e s ( htoh4 0 , htoh4 1 , h4toh ) : avg key 0 = htoh4 0 . mean ( dim =1) # ( num experts , 4h , h ) avg key 1 = htoh4 1 . mean ( dim =1) # ( num experts , 4h , h ) avg value = h4toh . mean ( dim =2) # ( num experts , h , 4h ) normed key 0 = nn . f u n c t i o n a l . normalize ( avg key 0 , p=2 , dim =1) normed key 1 = nn . f u n c t i o n a l . normalize ( avg key 1 , p=2 , dim =1) normed value = nn . f u n c t i o n a l . normalize ( avg value , p=2 , dim =1) normed avg expert = t o r c h . c a t ( [ normed key 0 , normed key 1 , normed value ] , dim =1) # compute the average e x p e r t s i m i l a r i t y s i m i l a r i t y = t o r c h .mm( normed avg expert , normed avg expert . t ( ) ) avg sim = n o r m e d s i m i l a r i t y . mean ( ) . item ( ) r e t u r n avg sim

A.4.4 MORE ROUTER ENTROPY DISTRIBUTIONS

Published as a conference paper at ICLR 2025

0 1 2 3 4 5 6 7 Layer

0.761 0.742 0.645 0.669 0.619 0.663 0.612

0.949 0.861 0.811 0.815 0.809 0.813

1.017 0.965 0.921 0.944 0.913

1.011 0.994 0.935 0.965

1.013 0.980 0.927

1.038 1.004

Mean MI of RMo E-NP-r0.5

0 1 2 3 4 5 6 7 Layer

0.274 0.297 0.283 0.321 0.333 0.399 0.398

0.387 0.353 0.421 0.449 0.479 0.516

0.465 0.548 0.580 0.578 0.654

0.476 0.530 0.602 0.655

0.679 0.659 0.764

0.714 0.821

Mean MI of Cosine Mo E

Figure 6: Mutual information of RMo E-NP-r0.5 and Cosine Mo E settings

0 5 10 15 20 Layer

0.145 0.133 0.122 0.112 0.125 0.122 0.141 0.131 0.120 0.117 0.119 0.117 0.139 0.125 0.125 0.115 0.124 0.136 0.142 0.136 0.125 0.115 0.124

0.157 0.152 0.130 0.148 0.136 0.180 0.155 0.146 0.137 0.144 0.138 0.172 0.150 0.152 0.136 0.144 0.151 0.178 0.159 0.150 0.140 0.144

0.148 0.131 0.135 0.125 0.154 0.178 0.146 0.137 0.132 0.128 0.145 0.165 0.150 0.136 0.133 0.140 0.151 0.181 0.151 0.142 0.133

0.137 0.141 0.117 0.147 0.148 0.170 0.140 0.134 0.113 0.141 0.145 0.170 0.142 0.134 0.129 0.145 0.155 0.174 0.142 0.137

0.127 0.100 0.125 0.135 0.135 0.148 0.121 0.107 0.125 0.133 0.139 0.153 0.127 0.112 0.129 0.138 0.138 0.149 0.125

0.114 0.145 0.135 0.134 0.136 0.153 0.118 0.135 0.135 0.143 0.131 0.159 0.131 0.145 0.141 0.142 0.136 0.161

0.133 0.124 0.113 0.110 0.108 0.115 0.129 0.119 0.114 0.104 0.111 0.130 0.131 0.129 0.117 0.110 0.108

0.152 0.146 0.135 0.141 0.137 0.164 0.149 0.149 0.136 0.142 0.149 0.170 0.153 0.148 0.140 0.145

0.144 0.139 0.130 0.126 0.147 0.175 0.147 0.137 0.138 0.135 0.148 0.184 0.153 0.140 0.136

0.138 0.133 0.113 0.137 0.144 0.177 0.139 0.130 0.123 0.139 0.157 0.179 0.142 0.135

0.130 0.108 0.130 0.137 0.144 0.167 0.133 0.120 0.131 0.146 0.142 0.160 0.133

0.115 0.134 0.126 0.134 0.128 0.150 0.124 0.138 0.133 0.139 0.137 0.157

0.130 0.122 0.118 0.113 0.115 0.130 0.136 0.130 0.117 0.110 0.116

0.144 0.142 0.128 0.134 0.139 0.165 0.148 0.140 0.129 0.137

0.147 0.136 0.130 0.130 0.145 0.182 0.147 0.138 0.133

0.140 0.135 0.128 0.147 0.157 0.177 0.142 0.136

0.133 0.117 0.134 0.142 0.141 0.156 0.127

0.126 0.138 0.136 0.137 0.134 0.160

0.148 0.141 0.129 0.121 0.129

0.150 0.150 0.135 0.141

0.156 0.149 0.139

0.143 0.142

Mean Mutual Information of Naive Router

0 5 10 15 20 Layer

0.439 0.425 0.429 0.420 0.419 0.401 0.438 0.440 0.427 0.412 0.425 0.396 0.437 0.423 0.419 0.412 0.413 0.419 0.434 0.442 0.435 0.415 0.417

0.458 0.478 0.443 0.461 0.426 0.505 0.468 0.470 0.440 0.465 0.424 0.493 0.455 0.465 0.436 0.454 0.443 0.492 0.474 0.478 0.442 0.453

0.467 0.449 0.436 0.413 0.457 0.492 0.458 0.435 0.440 0.410 0.451 0.472 0.448 0.436 0.435 0.436 0.450 0.489 0.466 0.442 0.435

0.463 0.466 0.428 0.474 0.481 0.501 0.458 0.474 0.422 0.472 0.464 0.499 0.454 0.469 0.445 0.467 0.485 0.516 0.456 0.465

0.442 0.412 0.448 0.459 0.459 0.463 0.444 0.405 0.443 0.442 0.446 0.471 0.442 0.425 0.447 0.467 0.468 0.472 0.441

0.408 0.463 0.451 0.458 0.437 0.481 0.399 0.457 0.443 0.451 0.438 0.476 0.430 0.457 0.460 0.471 0.443 0.471

0.429 0.431 0.422 0.402 0.416 0.390 0.424 0.413 0.409 0.397 0.408 0.412 0.426 0.435 0.423 0.411 0.404

0.474 0.468 0.439 0.470 0.421 0.484 0.453 0.459 0.438 0.458 0.444 0.484 0.480 0.478 0.445 0.457

0.470 0.456 0.458 0.422 0.465 0.484 0.461 0.449 0.449 0.444 0.466 0.507 0.477 0.451 0.451

0.449 0.465 0.411 0.467 0.456 0.502 0.445 0.462 0.438 0.465 0.477 0.515 0.454 0.459

0.446 0.398 0.441 0.437 0.445 0.476 0.441 0.420 0.436 0.458 0.456 0.468 0.438

0.407 0.469 0.440 0.463 0.442 0.475 0.431 0.465 0.466 0.477 0.447 0.487

0.420 0.413 0.406 0.393 0.400 0.416 0.427 0.431 0.424 0.402 0.394

0.449 0.458 0.431 0.450 0.443 0.484 0.476 0.475 0.445 0.455

0.447 0.436 0.432 0.430 0.448 0.509 0.470 0.443 0.436

0.444 0.455 0.425 0.450 0.470 0.497 0.448 0.453

0.440 0.417 0.430 0.453 0.458 0.462 0.440

0.423 0.456 0.458 0.474 0.440 0.479

0.443 0.454 0.449 0.427 0.424

0.472 0.479 0.440 0.458

0.489 0.463 0.463

0.461 0.473

Mean Mutual Information of Recurrent Router

0 5 10 15 20 Layer

0.272 0.278 0.261 0.286 0.281 0.428 0.271 0.271 0.264 0.289 0.279 0.424 0.269 0.272 0.268 0.292 0.278 0.419 0.272 0.275 0.259 0.283 0.278

0.256 0.262 0.267 0.265 0.276 0.408 0.255 0.257 0.270 0.269 0.275 0.406 0.252 0.259 0.269 0.263 0.271 0.407 0.252 0.260 0.267 0.264

0.258 0.276 0.266 0.278 0.250 0.408 0.258 0.283 0.267 0.280 0.255 0.404 0.259 0.284 0.263 0.278 0.261 0.406 0.257 0.280 0.264

0.270 0.274 0.261 0.258 0.257 0.398 0.275 0.276 0.262 0.258 0.251 0.405 0.275 0.271 0.261 0.261 0.252 0.407 0.270 0.275

0.283 0.287 0.264 0.272 0.271 0.405 0.283 0.288 0.264 0.272 0.270 0.413 0.276 0.283 0.267 0.273 0.272 0.408 0.277

0.279 0.262 0.261 0.273 0.291 0.422 0.281 0.267 0.261 0.275 0.292 0.415 0.278 0.268 0.261 0.272 0.283 0.420

0.271 0.273 0.265 0.294 0.283 0.427 0.271 0.271 0.268 0.289 0.283 0.428 0.272 0.271 0.261 0.288 0.276

0.249 0.259 0.268 0.264 0.269 0.395 0.248 0.256 0.271 0.264 0.268 0.396 0.249 0.255 0.267 0.260

0.259 0.281 0.261 0.277 0.249 0.409 0.254 0.279 0.261 0.271 0.255 0.402 0.256 0.273 0.256

0.272 0.276 0.266 0.255 0.255 0.407 0.277 0.271 0.264 0.259 0.256 0.414 0.269 0.268

0.291 0.290 0.269 0.276 0.276 0.426 0.286 0.287 0.269 0.275 0.277 0.413 0.287

0.279 0.267 0.260 0.280 0.288 0.413 0.274 0.272 0.261 0.271 0.285 0.421

0.271 0.274 0.266 0.290 0.278 0.432 0.272 0.274 0.261 0.289 0.276

0.251 0.258 0.267 0.264 0.271 0.399 0.248 0.255 0.264 0.264

0.251 0.277 0.259 0.270 0.254 0.405 0.252 0.276 0.260

0.274 0.274 0.267 0.258 0.253 0.401 0.266 0.268

0.284 0.288 0.270 0.277 0.276 0.413 0.283

0.273 0.261 0.256 0.270 0.283 0.423

0.269 0.271 0.260 0.286 0.273

0.254 0.258 0.267 0.263

0.252 0.272 0.259

0.274 0.269

Mean Mutual Information of Recurrent Router w/o Passing Hiddens

0 5 10 15 20 Layer

0.673 0.587 0.561 0.526 0.515 0.565 0.515 0.499 0.511 0.507 0.506 0.555 0.506 0.492 0.509 0.499 0.500 0.562 0.507 0.491 0.515 0.501 0.502

0.903 0.837 0.786 0.758 0.779 0.792 0.756 0.759 0.756 0.743 0.770 0.784 0.750 0.757 0.749 0.735 0.780 0.775 0.749 0.766 0.751 0.739

1.011 0.908 0.879 0.854 0.857 0.910 0.878 0.860 0.846 0.833 0.840 0.892 0.877 0.852 0.841 0.856 0.836 0.901 0.886 0.853 0.849

1.030 0.946 0.907 0.891 0.920 0.973 0.932 0.902 0.882 0.870 0.901 0.969 0.918 0.893 0.898 0.867 0.906 0.977 0.919 0.901

1.024 0.924 0.896 0.899 0.924 0.975 0.916 0.877 0.861 0.876 0.918 0.967 0.911 0.894 0.863 0.877 0.927 0.960 0.919

1.014 0.918 0.917 0.919 0.939 0.970 0.899 0.863 0.879 0.905 0.924 0.964 0.919 0.861 0.885 0.918 0.926 0.972

0.989 0.911 0.905 0.902 0.906 0.920 0.863 0.854 0.881 0.880 0.894 0.939 0.864 0.861 0.895 0.882 0.904

0.995 0.928 0.909 0.885 0.871 0.893 0.872 0.879 0.876 0.865 0.881 0.885 0.871 0.887 0.874 0.877

1.032 0.946 0.909 0.874 0.875 0.931 0.909 0.892 0.887 0.884 0.867 0.938 0.925 0.891 0.897

1.042 0.947 0.899 0.877 0.906 0.971 0.926 0.903 0.903 0.869 0.909 0.985 0.925 0.909

1.039 0.932 0.895 0.896 0.931 0.989 0.935 0.910 0.870 0.890 0.944 0.986 0.941

1.013 0.910 0.901 0.914 0.928 0.966 0.920 0.859 0.880 0.912 0.925 0.972

0.976 0.900 0.900 0.890 0.900 0.936 0.857 0.856 0.886 0.879 0.902

0.982 0.920 0.890 0.875 0.883 0.879 0.863 0.880 0.871 0.869

1.019 0.928 0.896 0.882 0.864 0.929 0.913 0.880 0.883

1.032 0.945 0.923 0.877 0.914 0.984 0.922 0.906

1.029 0.939 0.886 0.895 0.938 0.973 0.924

1.024 0.897 0.897 0.918 0.926 0.971

0.990 0.924 0.930 0.909 0.926

0.984 0.933 0.892 0.881

1.032 0.933 0.911

1.039 0.959

Mean Mutual Information of Recurrent Router w/o Passing Hiddens and Share 0.5 Logits

Figure 7: Mutual information of SMo E, RMo E, RMo E-NP, and RMo E-NP-r0.5 in 24-layer models.

Published as a conference paper at ICLR 2025

Gate Entropy

Random Mo E

Cosine Mo E

Density (%)

Entropy of Layer 0 / (7)

SMo E RMo E Hyper Mo E Random Mo E Cosine Mo E XMo E

Gate Entropy

Random Mo E

Cosine Mo E

Density (%)

Entropy of Layer 1 / (7)

SMo E RMo E Hyper Mo E Random Mo E Cosine Mo E XMo E

Gate Entropy

Random Mo E

Cosine Mo E

Density (%)

Entropy of Layer 3 / (7)

SMo E RMo E Hyper Mo E Random Mo E Cosine Mo E XMo E

Gate Entropy

Random Mo E

Cosine Mo E

Density (%)

Entropy of Layer 4 / (7)

SMo E RMo E Hyper Mo E Random Mo E Cosine Mo E XMo E

Gate Entropy

Random Mo E

Cosine Mo E

Density (%)

Entropy of Layer 6 / (7)

SMo E RMo E Hyper Mo E Random Mo E Cosine Mo E XMo E

Gate Entropy

Random Mo E

Cosine Mo E

Density (%)

Entropy of Layer 7 / (7)

SMo E RMo E Hyper Mo E Random Mo E Cosine Mo E XMo E

Figure 8: Gate score entropy distribution over Enwiki test set for different routers in 8-layer models.

A.4.5 ROUTER WEIGHTS INFORMATION

A.4.6 EXPERT SELECTION FREQUENCY

A.5 ADDITIONAL RESULTS

Published as a conference paper at ICLR 2025

Table 12: More SMo E and RMo E variants pre-training costs and evaluation results in selected informative lm-evaluation-harness tasks. sft means supervised fine-tuning on the Alpaca dataset. The task names and metrics for short names in the table are: ARC-e for ARC-Easy, acc; Hella is for Hellaswag, acc-norm; Piqa for PIQA, acc-norm; Lamb for LAMBADA, acc.

Algorithm Training ARC-e Hella Piqa Sciq Lamb Avg

20B (5k steps) 47.14 35.51 64.69 76.2 14.61 47.63 +sft 50.93 35.82 65.61 74.7 17.81 48.97 +sft (freeze gate) 50.59 35.78 66.32 74.7 18.18 49.11

40B (10k steps) 52.57 40.85 67.74 83.4 26.74 54.26 +sft 53.7 42.07 68.61 83.5 32.8 56.13 +sft (freeze gate) 53.45 41.94 68.88 83.1 32.06 55.89

GRU p = 128

20B 47.01 35.91 65.23 78.7 19.13 49.20 +sft 48.53 36.9 66.21 79.6 24.74 51.20 +sft (freeze router) 48.65 36.88 66.43 80.1 24.55 51.32 +sft (freeze router and GRU) 49.24 36.79 66.16 79.7 24.32 51.24

40B 51.18 41.38 67.79 83.6 32.58 55.31 +sft 53.20 43.05 68.55 83.8 37.16 57.15 +sft (freeze router) 53.03 42.96 68.34 83.6 36.68 56.92 +sft (freeze router and GRU) 53.11 43.16 68.77 82.8 37.57 57.08

GRU p = 256

20B 47.47 35.91 65.78 76.2 20.03 49.08 +sft 48.36 36.49 65.07 77.4 22.86 50.04 +sft (freeze router) 48.27 36.42 65.23 76.9 22.88 49.94 +sft (freeze router and GRU) 48.23 36.46 64.94 77.3 22.61 49.91

40B 53.07 41.15 68.52 84.0 19.17 53.18 +sft 54.46 43.06 67.46 84.9 24.57 54.89 +sft (freeze router) 54.45 43.10 67.19 84.1 23.93 54.55 +sft (freeze router and GRU) 54.50 43.13 67.36 83.8 23.62 54.48

GRU p = 512

20B 47.77 35.39 64.80 79.5 25.00 50.49 +sft 48.27 36.47 65.51 76.6 22.18 49.81 +sft (freeze router) 47.73 36.41 65.78 76.6 22.88 49.88 +sft (freeze router and GRU) 48.19 36.22 65.29 76.8 23.5 50.00

40B 51.64 41.37 66.81 86.0 22.76 53.72 +sft 52.82 42.68 68.55 86.0 26.88 55.39 +sft (freeze router) 52.48 42.61 68.44 86.0 27.23 55.35 +sft (freeze router and GRU) 52.74 42.44 68.77 86.3 27.13 55.48

RNN p = 256

20B 46.63 35.7 64.91 76.1 16.24 47.92 +sft 48.40 36.45 65.51 77.3 22.65 50.06 +sft (freeze router) 48.70 36.29 65.45 77.3 22.60 50.07 +sft (freeze router and RNN) 49.24 36.48 65.56 77.7 23.20 50.44

Published as a conference paper at ICLR 2025

Gate Entropy

RMo E-np-0.5

RMo E-np-1.0

Density (%)

Entropy of Layer 0 / (7)

SMo E RMo E RMo E-np RMo E-np-0.5 RMo E-np-1.0

Gate Entropy

RMo E-np-0.5

RMo E-np-1.0

Density (%)

Entropy of Layer 2 / (7)

SMo E RMo E RMo E-np RMo E-np-0.5 RMo E-np-1.0

Gate Entropy

RMo E-np-0.5

RMo E-np-1.0

Density (%)

Entropy of Layer 3 / (7)

SMo E RMo E RMo E-np RMo E-np-0.5 RMo E-np-1.0

Gate Entropy

RMo E-np-0.5

RMo E-np-1.0

Density (%)

Entropy of Layer 4 / (7)

SMo E RMo E RMo E-np RMo E-np-0.5 RMo E-np-1.0

Gate Entropy

RMo E-np-0.5

RMo E-np-1.0

Density (%)

Entropy of Layer 5 / (7)

SMo E RMo E RMo E-np RMo E-np-0.5 RMo E-np-1.0

Gate Entropy

RMo E-np-0.5

RMo E-np-1.0

Density (%)

Entropy of Layer 7 / (7)

SMo E RMo E RMo E-np RMo E-np-0.5 RMo E-np-1.0

Figure 9: Gate score entropy distribution over Enwiki test set for different information passing settings in 8-layer models.

Published as a conference paper at ICLR 2025

Gate Entropy

Density (%)

Entropy of Layer 0

SMo E RMo E XMo E XMo E-R

Gate Entropy

Density (%)

Entropy of Layer 1

SMo E RMo E XMo E XMo E-R

Gate Entropy

Density (%)

Entropy of Layer 2

SMo E RMo E XMo E XMo E-R

Gate Entropy

Density (%)

Entropy of Layer 3

SMo E RMo E XMo E XMo E-R

Gate Entropy

Density (%)

Entropy of Layer 4

SMo E RMo E XMo E XMo E-R

Gate Entropy

Density (%)

Entropy of Layer 5

SMo E RMo E XMo E XMo E-R

Gate Entropy

Density (%)

Entropy of Layer 6

SMo E RMo E XMo E XMo E-R

Gate Entropy

Density (%)

Entropy of Layer 7

SMo E RMo E XMo E XMo E-R

Figure 10: Gate score entropy distribution over Enwiki test set for different routers. RMo E can be combined with XMo E to encourage the exploration of XMo E.

Published as a conference paper at ICLR 2025

0 1 2 3 4 5 6 7 8 9 10 11 Layer

Gate Weight Norm (12 layers)

SMo E RMo E RMo E_NP RMo E_NP_r05 Init

0 1 2 3 4 5 6 7 8 9 10 11 Layer

Gate Weight Std (12 layers)

SMo E RMo E RMo E_NP RMo E_NP_r05 Init

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Layer

Gate Weight Norm (18 layers)

SMo E RMo E RMo E_NP RMo E_NP_r05 Init

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Layer

Gate Weight Std (18 layers)

SMo E RMo E RMo E_NP RMo E_NP_r05 Init

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Layer

Gate Weight Norm (24 layers)

SMo E RMo E RMo E_NP RMo E_NP_r05 Init

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Layer

Gate Weight Std (24 layers)

SMo E RMo E RMo E_NP RMo E_NP_r05 Init

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Gate Weight Norm (32 layers)

SMo E RMo E RMo E_NP RMo E_NP_r05 Init

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Gate Weight Std (32 layers)

SMo E RMo E RMo E_NP RMo E_NP_r05 Init

Figure 11: Different layers router weight statistics (left column: norm and right column: standard deviation) in Enwiki8 setting. (1) different layers have different norms and STDs, which inspires us to introduce layerwise projector in Equ 4 and explains using the shared projector can hurt RMo E s performance (Tab. 6). (2) While SMo E routers show larger weight norms than RMo E settings, their standard deviations are not the highest. The large router norms can potentially explain the larger IB and OB in Tab. 8.

Published as a conference paper at ICLR 2025

0 1 2 3 4 5 6 7 Layer

Expert Frequency

Expert Frequency of SMo E

0 1 2 3 4 5 6 7 Layer

Expert Frequency

Expert Frequency of RMo E

0 1 2 3 4 5 6 7 Layer

Expert Frequency

Expert Frequency of Hyper Mo E

0 1 2 3 4 5 6 7 Layer

Expert Frequency

Expert Frequency of Random Mo E

0 1 2 3 4 5 6 7 Layer

Expert Frequency

Expert Frequency of XMo E

0 1 2 3 4 5 6 7 Layer

Expert Frequency

Expert Frequency of Cosine Mo E

Figure 12: Different methods expert selection frequency on medium size models in Enwiki8. (1) RMo E slightly increases expert imbalance than SMo E. (2) Methods using a frozen-random-initialize router (Hyper Mo E and Random Mo E) show more imbalance problems.

Published as a conference paper at ICLR 2025

0 1 2 3 4 5 6 7 layer

normed similarity

SMo E RMo E XMo E Hyper Mo E Stable Mo E Random Mo E

0 1 2 3 4 5 6 7 layer

normed similarity

SMo E RMo E XMo E XMo E_R

Figure 13: Expert similarity in Enwiki8 training experiments. Random Mo E shows the highest expert similarity. XMo E, which introduces down-projected cosine routing to resolve representation collapse in SMo E, shows the lowest expert similarity. While RMo E doesn t significantly diversify experts as in the large-scale training settings (left), it can be further combined with XMo E, which largely increases expert diversity and brings improvement (right).

5 10 15 20 25 30 Model Size (Layer)

Validation Results (BPC)

Validation BPC vs Layer

SMo E RMo E RMo E-NP RMo E-NP-r0.5

Figure 14: Validation BPC on Enwiki8 with different model sizes (6, 12, 18, 24, 32 layers).