# moec_mixture_of_expert_clusters__647b3e28.pdf

Mo EC: Mixture of Expert Clusters

Yuan Xie *, Shaohan Huang, Tianyu Chen, Furu Wei

Microsoft Research Asia, China {v-yuanxie, shaohanh, v-tianyuchen, fuwei}@microsoft.com

Sparsely Mixture of Experts (Mo E) has received great interest due to its promising scaling capability with affordable computational overhead. Mo E models convert dense layers into sparse experts, and utilize a gated routing network to make experts conditionally activated. However, as the number of experts grows, Mo E with outrageous parameters suffers from overfitting and sparse data allocation. Such problems are especially severe on tasks with limited data, thus hindering the progress towards improving performance by scaling up. We verify that there exists a performance upper bound of scaling up sparse Mo E. In this work, we propose Mixture of Expert Clusters a general approach to enable expert layers to learn more diverse and appropriate knowledge by imposing variance-based constraints on the routing stage. Given this, we could further propose a cluster-level expert dropout strategy specifically designed for the expert cluster structure. Our experiments reveal that Mo EC could improve performance on machine translation and natural language understanding tasks. Mo EC plays a positive role in mitigating overfitting and sparse data allocation problems, thus fully releasing the potential of large-scale sparse models.

Introduction Scaling up the model capacity has shown to be promising to achieve better performance on a variety of tasks, including natural language understanding (Brown et al. 2020; Raffel et al. 2019) and visual representation learning (Dosovitskiy et al. 2020; Bao, Dong, and Wei 2021). The continued growth in model size and parameters brings higher computational cost, while large dense models have almost hit the boundary of hardware capacity. In pursuit of better computational efficiency, sparse Mixture-of-Experts (Mo E) is proposed as an efficient alternative to dense models (Lepikhin et al. 2020; Fedus, Zoph, and Shazeer 2021; Riquelme et al. 2021; Lewis et al. 2021). For the sparsely-gated Mo E transformers, the feed-forward network (FFN) sub-layer will be replaced by a set of experts with independent parameters. The sparsity of Mo E is brought by experts and the gated routing network. The gated routing network will calculate

*Work done during internships at Microsoft Research Asia. The source code of this paper could be obtained from https://github.com/xy980523/Mo Ec model. Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

the routing score between input tokens and each expert and activate experts with top-k routing scores. Most experts will not be activated, thus forming a sparse structure. Since the computation cost is only proportional to the activated top-k sub-network, sparsely activated Mo E models could scale up model parameters without significantly increasing computational cost. With affordable computational overhead, Mo E models could achieve better performance than dense models on various tasks such as neural machine translation (Lewis et al. 2019; Conneau and Lample 2019; Lepikhin et al. 2020),image recognition (Riquelme et al. 2021) and speech recognition (Kumatani et al. 2021). Recent studies have reached a consensus that more experts mean more parameters and large model capacity, which always bring improvements. However, some studies show that more trainable parameters and sparse conditional computation may introduce overfitting (Xue et al. 2021; Lou et al. 2021; Xue et al. 2022), especially for downstream tasks with limited data. As depicted in Figure 1, as the number of experts grows, overfitting gradually becomes apparent in the machine translation task. Moreover, we find that enlarging the size of the Mo E will not always lead to improvement. There seems to exist a performance upper bound of scaling up experts with limited data. Besides, we find an unreasonable phenomenon in Figure 1: 64-expert Mo E with more parameters and larger model capacity has higher training loss than 32-expert Mo E. It implies that large-scale Mo E not only suffers from overfitting, but also has other hidden problems that affect training. According to our analysis, the probability of each expert getting a token reduces proportionately as the number of experts grows. With the same data, each expert will get less diverse samples. Insufficient data not only affects the training of expert layers, but also aggravate overfitting. Therefore, we want to explore ways in which experts could get diverse samples and learn abundant knowledge, thereby alleviating overfitting and sparse data allocation. In this work, we propose Mixture of Expert Clusters (Mo EC), a general optimizing strategy for Mo E models. All the experts in Mo E models are clustered to form several static expert clusters by minimizing the clustering loss associated with the routing probability among neighbor experts. The inductive bias expects that the similarity of intra-cluster experts is high while the similarity of inter-cluster experts is

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

Figure 1: A simple demonstration of loss curves of Mo E models on WMT-14 English-to-German translation task. We show the loss curve of Mo E baseline models with different experts. The value in the box represents the minimum loss.

low. Experts within a cluster are prone to tokens with similar hidden states and could share similar tokens. Given the cluster structure, we further propose a cluster-level expert dropout strategy. Several experts in the cluster will be randomly dropped, and the dropped experts will not participate in the routing stage. The activated experts are selected from the remaining experts in the cluster. Implementing dropout within clusters could make tokens always dispatched to suitable experts, no matter how random the dropout is. We evaluate our Mo EC on machine translation and natural language understanding tasks. Experiment results show that Mo EC outperforms dense models and baseline Mo E models. It indicates that Mo EC retains the advantages of the sparse structure of Mo E, and alleviates overfitting and sparse data allocation problems. Our contributions are summarized as follows:

We point out the overfitting and sparse data allocation problems for large-scale Mo E models, and experts getting less diverse samples could be the common cause of both problems.

We propose to build expert clusters by variance-based constraints, which allows experts to get a more diverse set of similar tokens. We also implement cluster-level expert dropout as a regularization method.

We conduct experiments on machine translation and natural language understanding tasks. Mo EC could improve performance and alleviate problems caused by scaling up experts without changing the model structure and routing strategy.

We find that there exists a performance upper bound for scaling up Mo E models with limited data. Mo EC could raise the performance upper bound, thus exploiting the potential of large-scale sparse models.

Related Work In the context of modern deep learning architectures, scaling up transformers using sparse Mixture of Experts (Mo E) is proven to be effective to achieve state-of-the-art performance on various NLP and CV tasks (Shazeer et al. 2017; Lepikhin et al. 2020; Riquelme et al. 2021; Fedus, Zoph, and Shazeer 2021). Compared with dense transformers, an Mo E model contains several experts (feed-forward networks), and a router to select top-k experts for input tokens. It increases the model capacity by such conditional computation while maintaining computational efficiency. To future explore the potential of Mo E, some studies focus on router assignment algorithm (Lewis et al. 2021; Roller et al. 2021; Dai et al. 2022). Besides, some work focus on optimizing training methods for Mo E models. Dua et al. (2021) applied a temperature heating mechanism for sparse Mo E models on the translation task. Chi et al. (2022) proposed a dimension reduction to estimate the routing scores between tokens and experts on a low-dimensional hyper-sphere. Our work is also proposed to optimize the Mo E model. Instead of changing the model structure and routing strategy, Mo EC establishes static expert clusters, which allows experts to be assigned similar and more diverse tokens. Although Mo E models have achieved promising results, they are proven to have overfitting problems (Fedus, Zoph, and Shazeer 2021; Wu et al. 2022; Xue et al. 2022) on downstream tasks with limited data. To mitigate overfitting, some works use knowledge distillation to distill Mo E models into small-sized Mo E models or dense models (Xue et al. 2022; Dai et al. 2022). Another approach is to apply the dropout strategy during training. Fedus, Zoph, and Shazeer (2021) set a small dropout rate at non-expert layers and a larger dropout rate at expert layers. Liu et al. (2022) propose gating dropout, which allows some tokens to ignore the gated routing network and stay at their local machines to reduce cross-machine communication. In our work, we propose the cluster-level expert dropout. Randomly selected experts in the cluster will be dropped so that they will not participate in the routing stage.

Preliminary To build Mo E transformers, it is a common practice to replace feed-forward network (FFN) sub-layers with a set of experts. The experts share the same structure with the FFN layer in the dense transformer model. We denote the hidden representation of input token x as h, and the embedding for the i-th expert as ei. The router computes the routing score si = h Tei to compare the similarity between h and the set of experts E. Then, the router utilizes a gating function α( ) to compute the gated value of expert i.

exp(si) PE j=1 exp(sj) , softmax gating

1 1 + exp( si), sigmoid gating

The gating function αi represents the probability of dispatching input token to expert i. The top-k gated-value is

Figure 2: Illustration of a classic Mo E layer and our proposed static Mo EC layer. The similarity between hidden states Hi is represented by the color.

used for dispatching the token x according to αi. The corresponding k expert networks are conditionally activated. We denote the set of selected top-k indices as K.

i K αi Ei(x) (2)

where Ei(x) is the i-th expert network, which is a feedforward network. The output of the gated routing network is the linearly weighted combination of each expert s computation on the token by the gate value.

Method In this work, our goal is to give experts access to more diverse training samples, thus mitigating overfitting and sparse data allocation. We encourage the static clustered structure by minimizing the clustering loss associated with the routing probability among neighbor experts. We apply the variancebased clustering loss to implement constraints. Besides, we further propose a cluster-level expert dropout strategy. In our work, we use top-1 gating. Only the expert with the largest routing score is activated. And we choose softmax as our activation function. Experts in a cluster will be distributed on the same device to reduce communication costs.

Mixture of Expert Clusters We illustrated our Mo EC (Mixture of Expert Clusters) in Figure 2. For classic Mo E, the routing probability of tokens will not be constrained. The router will always dispatch input tokens to their best-matched experts, while other similar tokens have little chance of being selected. When scaling up the number of experts, the sparse data distribution will cause each expert to get less diverse tokens. Expert layers could not get adequately trained. Also, the amount of data is insufficient to match the growing number of parameters, which is the main reason for overfitting. In order to solve the problems of classic Mo E, our Mo EC allows each expert to get more rich and diverse tokens. We impose variance-based constraints on the routing stage, aiming to make neighbor experts have similar routing probabilities for input tokens,

thus forming expert clusters prone to tokens with similar hidden states. In Mo EC, experts will get a more diverse set of similar input tokens by sharing input tokens with other experts in the cluster. Compared with previous work related to Mo E, our training objective added an extra term - cluster loss. The overall training objective is to minimize:

L = Ltask + Lbalance + Lcluster (3)

Ltask is determined by the specific task. In our work, we employ the label smoothed cross-entropy loss for neural machine translation, masked language modeling loss for pretraining language model, and negative log-likelihood loss (NLL loss) or mean-squared loss (MSE loss) for GLUE tasks. In the following, we will introduce Lbalance and Lcluster. Load Balancing Loss. During training, there exists a load imbalance issue between experts (Shazeer et al. 2017; Lepikhin et al. 2020): Most tokens are dispatched to a small number of experts, while many other experts do not get sufficiently trained at all. Besides, imbalanced assignments will result in a high computational bottleneck in the Mo E layer and thus limit the computational efficiency. We follow the work in (Fedus, Zoph, and Shazeer 2021) and add the balance loss to the training objective to encourage a balanced load across experts. Given N experts indexed by i=1 to N, the balance loss is computed as follows:

Lbalance = αN

i=1 fi pi (4)

where fi is the fraction of tokens dispatching to expert i. We denote the number of tokens dispatched to the i-th expert as Counti. Given a batch B with T tokens, fi = Counti/T. pi is the fraction of the routing probability allocated for expert i in the batch B. It is calculated by averaging the probability of routing token x to expert i in the batch B.

x B αi(x) (5)

where αi(x)is the gating function depicted in Equation 1, which represents the probability of dispatching token x to expert i. The balance loss in Equation 4 encourages uniform routing since it would be minimized under a uniform distribution. To control the impact of balance loss in the training process, a hyper-parameter α is applied as a multiplicative coefficient for the loss. Throughout this work, we use an α = 10 2 which was sufficiently large to ensure load balancing while small enough not to overwhelm the primary cross-entropy objective. Clustering Loss. In our work, we find the sparse allocation of data severely hinders the adequate training of Mo E layers and exacerbates overfitting. In order to allow experts to get rich and diverse tokens to mitigate the impact of sparse allocation, we design the clustering loss. The loss is designed to constrain certain neighbor experts so that they will share similar routing probabilities to tokens, thus forming a static cluster-like distribution. For input tokens originally

dispatched to the best-matched experts, clustering loss will give them more opportunities to access other experts in the cluster. As a result, experts will be assigned a more diverse set of similar tokens, thus alleviating the problem of sparse allocation. In Mo E models with N experts, the clustering loss will guide experts to form m clusters (N should be divisible by m), and each cluster contains L = N m experts. We use Ej i to represent the j-th expert in the i-th cluster, while pj i represents the routing probability allocated for Ej i (i = 0, 1, ..., m 1; j = 0, 1, ..., L 1). According to the size and number of clusters, p0 i , p1 i , ..., p L 1 i will compose a onedimensional matrix Pi RL to represent the routing probabilities of the L experts in the i-th cluster, and we denote the mean value of them as pi. We define the clustering loss as follows:

Lclustering = βN Cintra Cinter

= βN Pm 1 i=0 δ( Pi)

m e µ max {pi} max2{pi}

(6) As can be seen from Equation 6, clustering loss is mainly composed of two parts: the variance-based intra-cluster constraint Cintra and the difference-based inter-cluster constraint Cinter. δ( Pi) = (p0 i pi)2+(p1 i pi)2+...+(p L 1 i pi)2] L represents the variance of the routing probability in the ith cluster. We compute the mean variance of m clusters as the intra-cluster constraint Cintra, which will be minimized when the routing probabilities of experts within the same cluster are balanced. Besides, we use Cinter to measure the probability difference between the dispatched cluster and the sub-optimal cluster. max { } means the max value of pi (i=0,1,...,m-1) and max2{ } means the second max value. Cinter will be minimized when the probability of a token being dispatched to a suboptimal cluster is low. µ is the coefficient used to control the value of Cinter. When we set µ = 0, the probability difference between clusters will not be considered. We could also set µ to a non-zero value to activate Cinter. We will conduct in-depth experiments and analysis on it in Section 5.6. To minimize clustering loss, the probability distribution within the cluster should be uniform, and the probability difference between the clusters should be more apparent (optional). In the initial training steps, the variance among experts will be very high, so the clustering loss will dominate the optimization and guide the rapid formation of expert clusters. When the intra-cluster variance is stable, the clustering loss will become relatively small to maintain the expert clusters. Similar to the practice in balance loss, a hyperparameter β is applied. The value of the β should be relatively small, because a large β means a strong clustering constraint, thus making experts in the cluster too similar. It will cause these experts to lose their characteristics, and the contributions of multiple similar experts are only approximately equal to one expert. In our work, we set the value of β as 10 2 by default. Experiments on the selection of β

Figure 3: Illustration of global-level expert dropout and cluster-level expert dropout. The similarity between hidden states Hi is represented by the color.

values could be found in Appendix A.

Cluster-level Expert Dropout

When applying large-scale Mo E models on tasks with limited data, overfitting issues naturally arise. Previous Mo Erelated work (Raffel et al. 2019; Fedus, Zoph, and Shazeer 2021) used dropout (Srivastava et al. 2014) at each layer to prevent overfitting. Here, cluster-level expert dropout acts as a regularization technique completely different from traditional dropout. It does not drop parameters, but drops some experts in the cluster, which makes the dispatching of tokens more random. Implementation in Clusters. First, our cluster-level expert dropout works at the routing stage, so it will only be implemented at expert layers. For experts in a cluster, we randomly drop some of them by deleting the expert ids from the candidate expert list when calculating the routing probability. Thus, the corresponding experts will be ignored in the routing stage. Assume the dropout rate as γ, only the remaining N(1 γ) experts will participate in the calculation of routing probability during training. The dimension of the matrix P will decrease from RN to RN (1 γ). All clusters implement the dropout simultaneously. It allows tokens to have more opportunities to be dispatched to other experts in the same cluster, instead of being repeatedly dispatched to the expert with the highest probability. From another perspective, each expert will receive more diverse tokens without adding training data. Cluster-Level Expert Dropout vs Traditional Expert Dropout. Traditional expert dropout is recommended in Fedus, Zoph, and Shazeer (2021). It is a dropout technique (Srivastava et al. 2014) to regularize Mo E models, which acts on the feed-forward layer to reduce overfitting caused by too many parameters. By setting a relatively small dropout rate at non-expert layers (0.1), expert dropout increases the dropout rate by an explicit amount at the interim feed-forward computation at each expert layer (0.4). Our expert dropout acts completely different from it. We perform random dropout on the candidate list of experts during the routing stage. It does not reduce the number of parameters during training but allocates tokens more diversely and flexibly. While traditional expert dropout is usually used for finetuning on downstream tasks, our cluster-level expert dropout

NMT GLUE Tasks

WMT14 En-De MNLI Co LA SST-2 QQP QNLI MRPC STS-B GLUE Avg

Dense 27.10 85.97 57.10 92.87 91.20 92.23 87.50 89.18 85.16 Mo E Baseline 30.59 87.27 75.60 93.30 91.37 92.33 86.30 88.28 87.78

Mo EC (w/o expert dropout) 32.21 87.37 75.93 93.43 91.45 92.40 88.07 89.11 88.25 Mo EC 32.50 87.37 76.80 93.37 91.40 92.45 88.23 89.24 88.41

Table 1: The performance on machine translation and GLUE tasks for baselines and Mo EC models. WMT-14 is measured on the test set, while GLUE tasks are measured on the development sets. We report the average results by a set of seeds (see Appendix C). All experiments are conducted with 64 experts.

is a general regularization mechanism with strong generality. In addition, our dropout can be applied together with Fedus expert dropout, and they can work together to improve the performance of Mo E. Why Cluster-Level Is Better? It is natural to think that expert dropout could be implemented at the global level, which provides more opportunities for tokens to access other sub-optimal experts. But for global-level expert dropout, as shown in Figure 3, if a random dropout happens to drop suitable experts, tokens may be dispatched to less relevant experts. Inappropriate dispatching may negatively impact the learning of experts. In Mo EC, We address this problem by exploiting the cluster-like structure and design a clusterlevel expert dropout. Cluster-level dropout could give tokens the option to be randomly re-dispatched while confining the routing results to a more reasonable range. No matter how random the dropout is implemented, tokens will always be dispatched to experts with similar routing probability. We will conduct in-depth experiments and analysis in Section 5.5.

Experiments We name our model Mo EC (Mixture of Expert Clusters), and evaluate the performance on bilingual machine translation and natural language understanding tasks. We use the X-Mo E model from Chi et al. (2022) as our backbone architecture, which has shown better performance than prior Mo E models such as Switch Transformers (Fedus, Zoph, and Shazeer 2021) on widely-used cross-lingual understanding benchmarks.

Evaluation Dataset WMT 2014 English-to-German. Ninth Workshop on Statistical Machine Translation (WMT 2014) releases a collection of datasets used in shared tasks including machine translation. We add additional news-commentary-v12 data from WMT-17 for training and validation. The total training data contains 3.96M English-to-German sentence pairs. GLUE. General Language Understanding Evaluation (Wang et al. 2018) benchmark is a collection of tools for evaluating the performance of models across a diverse set of existing NLU tasks, including MNLI (Williams, Nangia, and Bowman 2017), Co LA (Warstadt, Singh, and Bowman 2019), SST-2 (Socher et al. 2013), QQP, QNLI (Rajpurkar et al. 2016), MRPC (Dolan and Brockett

2005) and STS-B (Cer et al. 2017). We do not perform experiments on RTE because previous work (Chen et al. 2022) demonstrated that Mo E is not suitable for this task. It is worth mentioning that we will pre-train our model on the Books Corpus (Zhu et al. 2015) and English Wikipedia corpus for 120k steps before fine-tuning on GLUE tasks.

Experiments Setup

Model Architecture. For our Mo EC and all baseline models, we follow the recommended settings in Vaswani et al. (2017) and use Transformer-big as the unified backbone architecture on WMT 2014 English-German translation task. For GLUE tasks, we use Transformer-base as the backbone architecture. For Mo E layers, we apply the 64-expert Mo E model with 3 FFN sub-layers in the 3rd encoder block and 3rd decoder block (same as the setting in Lewis et al. (2021)). A more detailed model hyper-parameters could be found in Appendix B. Baselines. We conduct two baselines in our experiments. The first is dense transformer (Vaswani et al. 2017). For another, we follow the work in Chi et al. (2022) and apply X-Mo E as our Mo E baseline. It could serve as a strong baseline that shows better performance than Switch Transformer (Fedus, Zoph, and Shazeer 2021) on widely-used cross-lingual understanding benchmarks. The Mo E baseline estimates routing scores between tokens and experts on a low-dimensional hypersphere and adds a learnable temperature scalar in the gating function. For a fair comparison, the two baseline methods are built with the same setting as Mo EC, which could be found in Appendix B. Mo EC Hyper-Parameters. For Mo EC, several unique hyper-parameters are introduced. For clustering loss, we set β to 10 2 according to the experiment results (see Appendix A) and set µ = 0 by default. For cluster size (the number of experts in a cluster) and expert dropout rate, we will have detailed related experiments in the following sections. Training Hyper-Parameters. For a fair comparison, the dense model, Mo E baseline model, and Mo EC model share the same training hyper-parameters. All models are trained with the Adam optimizer (Kingma and Ba 2014) (β1 = 0.9, β2 = 0.98). The learning rate is set 5e 4 with 4000 warm-up steps and inverse square root scheduler (Raffel et al. 2019). Batch size, training steps, and dropout rate are set by different tasks, which are recorded in Appendix C.

Figure 4: Loss curves on the WMT-14 validation set. All experiments are conducted with 64 experts for a fair comparison. The numbers in boxes indicate the lowest validation loss. Our Mo EC shows excellent ability to mitigate overfitting.

Experiments Results

We train dense models, baseline Mo E and Mo EC models on several widely-used evaluation tasks and the results are shown in Table 1. Compared with dense models, Mo E models exhibit significant performance improvements, which benefit from the large model capacity. Our Mo EC could bring notable improvement over the Mo E baseline without applying the dropout strategy to experts. On WMT-14, it gives a 1.62 BLEU score boost. The advantage could be attributed to the clustered distribution of experts, which endows experts with more diverse training samples. Moreover, with the application of the cluster-level expert dropout strategy, the performance of Mo EC will be further improved. As shown in Figure 4, the Mo E baseline severely suffers from overfitting on WMT-14, while our Mo EC shows excellent ability to mitigate overfitting. The overfitting phenomenon on the validation set is almost eliminated, and the validation loss is relatively lower. It shows that when our Mo EC solves the sparse allocation of data, each expert could get more abundant and diverse training samples. In this way, the training data of each expert is kept sufficient, thereby alleviating the phenomenon of overfitting. Furthermore, we found that Mo EC converges slightly slower. It is due to the fact that each expert needs to learn from more diverse training samples, which takes more steps to allow the expert to get sufficiently trained.

Detailed Analysis of Expert Clusters

Next, we conduct a detailed analysis of expert clusters. Figure 5 shows the fraction of tokens dispatched to cluster0 (including expert 0 3) during training and inference. During training, experts in cluster0 get similar input tokens. It reveals that clustering does not affects load balancing issues. During inference, the routing probabilities of experts in the cluster vary, which indicates that experts still retain their own characteristics. Experts could learn more fine-grained knowledge, which is the advantage of multiple similar experts compared to a single expert. For WMT14, the BLEU score of Mo E with 16 experts is 30.49 (see Table 4), while the BLEU score of Mo E with 16 clusters (cluster size=4) is

Figure 5: Fraction of tokens dispatched to cluster0 (including Expert 0 3) of 64-expert Mo EC (cluster size = 4) during training and inference. The graph on the left represents the fraction of tokens dispatched to cluster0 during training, while the right shows the fraction of tokens dispatched to cluster0 during inference.

Cluster size Number of clusters BLEU

1 64 30.59 4 16 32.16 8 8 32.21 16 4 29.98

Table 2: The performance of Mo EC with different cluster sizes on WMT-14. All experiments were conducted with 64 experts. For a fair comparison, all methods do not employ the dropout on experts.

32.16. It proves that multiple similar experts have an obvious advantage over a single expert. The cluster size also has a critical impact on the learning of Mo EC, so we conduct experiments on different cluster sizes. As depicted in Table 2, the best performance is obtained when cluster size = 8. Compared to the Mo E baseline with 64 experts, expert clusters could bring about a 1.62 BLEU scores improvement. When the cluster size is relatively small, the data shared among experts will be less, and the improvement brought by Mo EC will not be fully exploited. As a special case, when cluster size=1, a single expert could not be called a cluster, and Mo EC is equivalent to Mo E baseline. When the cluster size is large, the data shared among experts will increase, but the similarity and correlation of these data will become lower, which will lead to an adverse impact on the professionalism of each expert. When we expand the cluster size to 16, the performance of Mo EC is even lower than that of the Mo E baseline, which means that an excessively large cluster size will suppress the advantages of Mo E structure and hurt the performance.

Expert Dropout: Cluster-Level vs Global-Level

In Table 3, we experiment on WMT-14 with the cluster-level expert dropout rate. We find that cluster-level dropout could enhance the generalization performance of Mo EC. Such a regularization method could bring a 0.29 BLEU scores improvement for Mo EC. Experimental results show that 0.5 is

Dropout rate Cluster-level Global-level

0 32.21 32.21 0.25 32.32 31.88 0.5 32.50 31.53 0.75 32.02 29.73

Table 3: Cluster-level vs global-level expert dropout on WMT-14. All experiments are conducted on the 64-expert Mo EC and cluster size = 8.

a good choice for the dropout rate. Besides, it is obvious that global-level expert dropout will hurt the performance. For cluster-level expert dropout, when dropping the bestmatched expert for input tokens, the routing decision will still be made among the rest experts in the cluster. Regardless of how the dropped experts are selected, there will always be experts left in each cluster. It ensures that suitable experts are always available. But for the global-level one, due to the random distribution of experts, if all matched experts are dropped, the token will be routed to an inappropriate expert. It could cause experts to be distracted by low-relevant data, thus negatively impacting the learning of knowledge. Take Figure 3 as a simple example (with setting the dropout rate to 0.5). For global-level expert dropout, when both expert1 and expert2 are dropped, then Hn will only be dispatched to expert3 or expert4. This inappropriate allocation could hurt the performance of the model.

Role of the Inter-Cluster Constraint Coefficient We further explore whether the inter-cluster constraint coefficient Cinter (in Equation 6) will help improve performance. As depicted in Figure 6, when dropout=0.75 or cluster size=4, setting µ = 1 to activate the inter-cluster constraints will bring better results. In this case (the cluster size is small or the expert dropout rate is high), the number of experts in the cluster is small, and the intra-cluster constraint alone is not enough to form a globally reasonable routing probability distribution, so the assistance of constraints between clusters is needed. When there are sufficient experts in the cluster, it is better not to apply the inter-cluster constraint by setting µ to 0. Intra-cluster constraints have already made other experts in the cluster get higher routing probabilities, while intercluster constraints will further widen the routing probability gap between clusters. This will cause the entropy of the routing probability distribution to be too small, which is not conducive to the learning of the gated network.

Raising the Upper Bound of Mo E In general, a higher number of experts means higher model capacity and better performance. However, for tasks with limited data, there exists a performance upper bound on scaling up Mo E models. We take a deep dive into the ability of Mo EC to raise the upper bound. As shown in Table 4, the MOE baseline model reaches the performance upper bound when the number of experts is 32. It means that continuing to increase the number of experts will not bring any gain to

Figure 6: Two sets of experiments on the inter-cluster constraint coefficient Cinter. All experiments are performed on WMT14 En-De. The figure on the left is about experiments with different expert dropout rates (cluster size=8), and The figure on the right is about experiments with different cluster sizes (without expert dropout).

Expert num Mo E baseline Mo EC Benefits

16 30.49 30.50 +0.01 32 30.81 30.84 +0.03 64 30.59 32.50 +1.91 128 30.21 32.40 +2.19

Table 4: Results of scaling up Mo EC.

the model. Our Mo EC not only has a performance advantage over the Mo E baseline with the same number of experts, but also improves the upper bound from 32 to 64. With the increase of experts, our Mo EC could bring more gains. It is because Mo EC could fully show its promising ability to solve severe overfitting and sparse allocation problems. With the mitigation of the above two problems, the superiority of the large-scale Mo E model will be better exerted, thereby achieving the improvement of the upper bound of Mo E models. With the help of Mo EC, we could try to build sparse models with more experts.

Conclusion In our work, we point out the overfitting and the sparse data allocation problems of large-scale Mo E models and propose a novel training strategy - Mo EC to convert experts into clusters. Each expert could get more abundant and diverse training samples. In this way, the training data of each expert is kept sufficient, thereby alleviating overfitting. We also propose the cluster-level expert dropout to realize regularization. We conduct experiments on machine translation and natural language understanding tasks. Experiment results show Mo EC could improve performance and alleviate problems caused by scaling up experts without changing the model structure and routing strategy. The superiority of the large-scale Mo E model will be better exerted by Mo EC, thereby raising the upper bound of Mo E models. With the help of Mo EC, we could try to build sparse models with more experts.

Selection of the Value of Clustering Coefficient

Value of β Mo EC

1e-3 32.21 5e-3 32.17 1e-2 32.32 5e-2 31.21

Table 5: The performance of Mo EC with different β coefficients on WMT-14. All experiments are conducted with 64 experts. The cluster sizes=8, and expert dropout rate=0.25.

Table 5 presents the experiments on selecting the best value of β. Mo EC works best when β is set to 1e-2. And when the beta value is too large, the performance of Mo EC drops significantly, which confirms our analysis in the main text. Based on the results, we uniformly set the value of β as 10 2 as a default in all experiments above.

Architecture Parameters Table 6 presents the architecture parameters for different tasks. For the WMT-14 task, we apply a larger Transformer model with more attention heads and larger embedding dimensions. As our backbone architecture, X-Mo E (Chi et al. 2022) applies half of the number of experts as the routing dimension.

Training Hyper-Parameters Table 7 presents the training hyper-parameters for WMT-14 and pre-training. To avoid the two regularization methods (Mo E dropout and our cluster-level dropout) affecting each other, we do not use expert dropout for a fair comparison. Table 8 presents the training hyper-parameters on downstream GLUE tasks. For GLUE tasks with a relatively large amount of training data (MNLI, SST-2, QQP, QNLI), we set smaller training epochs. For those with limited data (Co LA, STS-B, MRPC, RTE), we try to increase the number of training epochs to 10.

- WMT-14 Pretrain&GLUE

Transformer blocks 12 12 Attention heads 16 12 Encoder ebd 1024 768 Decoder ebd 1024 768 FFN ebd 4096 3072

Experts 16,32,64,128 16,32,64,128 Routing dimension 8,16,32,64 8,16,32,64 Mo E layers 2 1 Sub-layers 3 3

Table 6: Architecture parameters for all tasks. ebd is short for embedding .

- WMT-14 En-De Pre-train

Optimizer Adam Adam Adam ϵ 1e-6 1e-6 Adam β (0.9,0.98) (0.9,0.98) Training steps 32k 125k Batch size 8k 2k Maximum LR 5e 4 5e 4 LR scheduler inverse sqrt inverse sqrt Warmup steps 4k 4k Weight decay 0 0.01 Dropout 0.3 0.1 Attention dropout 0.1 0 Gradient clip norm 0.1 0.1 Label smoothing 0.1 -

Capacity factor 2 2 Mo E dropout 0 0 Coefficient α 0.01 0.01

Table 7: Training hyper-parameters for all tasks. LR represents learning rate .

- Co LA,RTE STS-B MRPC Else

BSZ 32 32 32 32 Epochs 3,5,10 10,15,20 5,10,15,20 3,5 LR [1,2,4]e 5 [1,2,4]e 5 [1,2,4]e 5 [1,2,4]e 5 Warm 16 16 16 16 Seed 1,2,3 2,42,123 2,42,123 1,2,3

Table 8: Training hyper-parameters for GLUE. BSZ represents Batch size , LR represents Learning rate and Warm represents Warmup steps .

Bao, H.; Dong, L.; and Wei, F. 2021. Beit: Bert pre-training of image transformers. ar Xiv preprint ar Xiv:2106.08254. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877 1901. Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; and Specia, L. 2017. Semeval-2017 task 1: Semantic textual similaritymultilingual and cross-lingual focused evaluation. ar Xiv preprint ar Xiv:1708.00055. Chen, T.; Huang, S.; Xie, Y.; Jiao, B.; Jiang, D.; Zhou, H.; Li, J.; and Wei, F. 2022. Task-Specific Expert Pruning for Sparse Mixture-of-Experts. ar Xiv preprint ar Xiv:2206.00277. Chi, Z.; Dong, L.; Huang, S.; Dai, D.; Ma, S.; Patra, B.; Singhal, S.; Bajaj, P.; Song, X.; and Wei, F. 2022. On the Representation Collapse of Sparse Mixture of Experts. ar Xiv preprint ar Xiv:2204.09179. Conneau, A.; and Lample, G. 2019. Cross-lingual language

model pretraining. Advances in neural information processing systems, 32. Dai, D.; Dong, L.; Ma, S.; Zheng, B.; Sui, Z.; Chang, B.; and Wei, F. 2022. Stable Mo E: Stable routing strategy for mixture of experts. ar Xiv preprint ar Xiv:2204.08396. Dolan, B.; and Brockett, C. 2005. Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005). Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929. Dua, D.; Bhosale, S.; Goswami, V.; Cross, J.; Lewis, M.; and Fan, A. 2021. Tricks for Training Sparse Translation Models. ar Xiv preprint ar Xiv:2110.08246. Fedus, W.; Zoph, B.; and Shazeer, N. 2021. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. ar Xiv preprint ar Xiv:2101.03961. Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980. Kumatani, K.; Gmyr, R.; Salinas, F. C.; Liu, L.; Zuo, W.; Patel, D.; Sun, E.; and Shi, Y. 2021. Building a great multilingual teacher with sparsely-gated mixture of experts for speech recognition. ar Xiv preprint ar Xiv:2112.05820. Lepikhin, D.; Lee, H.; Xu, Y.; Chen, D.; Firat, O.; Huang, Y.; Krikun, M.; Shazeer, N.; and Chen, Z. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. ar Xiv preprint ar Xiv:2006.16668. Lewis, M.; Bhosale, S.; Dettmers, T.; Goyal, N.; and Zettlemoyer, L. 2021. Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning, 6265 6274. PMLR. Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. ar Xiv preprint ar Xiv:1910.13461. Liu, R.; Kim, Y. J.; Muzio, A.; Mozafari, B.; and Awadalla, H. H. 2022. Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers. ar Xiv preprint ar Xiv:2205.14336. Lou, Y.; Xue, F.; Zheng, Z.; and You, Y. 2021. Sparse-mlp: A fully-mlp architecture with conditional computation. ar Xiv preprint ar Xiv:2109.02008. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. ar Xiv preprint ar Xiv:1910.10683. Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. Squad: 100,000+ questions for machine comprehension of text. ar Xiv preprint ar Xiv:1606.05250. Riquelme, C.; Puigcerver, J.; Mustafa, B.; Neumann, M.; Jenatton, R.; Susano Pinto, A.; Keysers, D.; and Houlsby, N. 2021. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34.

Roller, S.; Sukhbaatar, S.; Weston, J.; et al. 2021. Hash layers for large sparse models. Advances in Neural Information Processing Systems, 34. Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; and Dean, J. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ar Xiv preprint ar Xiv:1701.06538. Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng, A. Y.; and Potts, C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, 1631 1642. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1): 1929 1958. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. R. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. ar Xiv preprint ar Xiv:1804.07461. Warstadt, A.; Singh, A.; and Bowman, S. R. 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7: 625 641. Williams, A.; Nangia, N.; and Bowman, S. R. 2017. A broad-coverage challenge corpus for sentence understanding through inference. ar Xiv preprint ar Xiv:1704.05426. Wu, L.; Liu, M.; Chen, Y.; Chen, D.; Dai, X.; and Yuan, L. 2022. Residual Mixture of Experts. ar Xiv preprint ar Xiv:2204.09636. Xue, F.; He, X.; Ren, X.; Lou, Y.; and You, Y. 2022. One Student Knows All Experts Know: From Sparse to Dense. ar Xiv preprint ar Xiv:2201.10890. Xue, F.; Shi, Z.; Wei, F.; Lou, Y.; Liu, Y.; and You, Y. 2021. Go wider instead of deeper. ar Xiv preprint ar Xiv:2107.11817. Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; and Fidler, S. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, 19 27.