# tight_clusters_make_specialized_experts__1fd56576.pdf

Published as a conference paper at ICLR 2025

TIGHT CLUSTERS MAKE SPECIALIZED EXPERTS

Stefan K. Nielsen FPT Software AI Center stefannvkp@fpt.com

Rachel S.Y. Teo Department of Mathematics National University of Singapore rachel.tsy@u.nus.edu

Laziz U. Abdullaev Department of Mathematics National University of Singapore laziz.abdullaev@u.nus.edu

Tan M. Nguyen Department of Mathematics National University of Singapore tanmn@nus.edu.sg

Sparse Mixture-of-Experts (Mo E) architectures have emerged as a promising approach to decoupling model capacity from computational cost. At the core of the Mo E model is the router, which learns the underlying clustering structure of the input distribution in order to send input tokens to appropriate experts. However, latent clusters may be unidentifiable in high dimension, which causes slow convergence, susceptibility to data contamination, and overall degraded representations as the router is unable to perform appropriate token-expert matching. We examine the router through the lens of clustering optimization and derive optimal feature weights that maximally identify the latent clusters. We use these weights to compute the token-expert routing assignments in an adaptively transformed space that promotes well-separated clusters, which helps identify the best-matched expert for each token. In particular, for each expert cluster, we compute a set of weights that scales features according to whether that expert clusters tightly along that feature. We term this novel router the Adaptive Clustering (AC) router. Our AC router enables the Mo E model to obtain three connected benefits: 1) faster convergence, 2) better robustness to data corruption, and 3) overall performance improvement, as experts are specialized in semantically distinct regions of the input space. We empirically demonstrate the advantages of our AC router over baseline routing methods when applied on a variety of Mo E backbones for language modeling and image recognition tasks in both clean and corrupted settings. The code is publicly available at https://github.com/stefvk/ACMo E.

1 INTRODUCTION

Scaling up model capacity continues to deliver substantial performance gains across a wide range of tasks, with particularly impressive results in visual representation learning and language modeling (Alexey, 2020; Bao et al., 2021; Radford et al., 2019; Raffel et al., 2020; Nguyen et al., 2023). However, larger models incur growing computational costs, prompting increasing research into Sparse Mixture-of-Experts models (Mo E), which offers a promising avenue to balancing model scale with efficiency by activating only sub-modules, termed experts, of the network during training and inference (Shazeer et al., 2017; Fedus et al., 2022; Lepikhin et al., 2020; Nguyen et al., 2025). This approach has been shown to achieve better performance than dense models with nearly constant computational overhead on tasks from speech recognition, image recognition, machine translation, and language modeling (Riquelme et al., 2021; Kumatani et al., 2021; Lepikhin et al., 2020; Teo & Nguyen, 2025a).

At the core of the Mo E layer is the learned router which assigns inputs to the relevant experts. The router must learn to segment the input space appropriately such that inputs and experts are well matched, enabling the experts to be trained on semantically similar data. This expert specialization

Co-first authors. Please correspond to: stefannvkp@fpt.com and tanmn@nus.edu.sg

Published as a conference paper at ICLR 2025

Figure 1: ACMo E discovers semantically distinct regions. We show 14x14 image reconstructions where patches are colored by assigned experts. Top row: Swin assigns large chunks of foreground and background to one expert (red), while ACMo E accurately discovers the bird and relevant foreground. Bottom row: When the background and foreground are hard to distinguish, Swin s router fails to register the stingray (left) or shark (right) and allocates one expert for virtually the entire image. ACMo E, however, discovers the semantically distinct regions, using one expert (green) to specialize on the foreground and different experts for the background.

allows Mo E models to produce better representations than their dense counterparts while activating only a fraction of the total parameters. Recently, various methods have been proposed to find optimal expert-token matches, including linear programs (Lewis et al., 2021), cosine similarity-based rules (Chi et al., 2022), soft assignments via convex combinations of inputs (Puigcerver et al., 2023), and both top-k experts per token (Shazeer et al., 2017) and top-k tokens per expert (Zhou et al., 2022b). We note that the above approaches fundamentally rely on dot-products between inputs and experts to learn the corresponding assignment, which might be suboptimal in cases where the semantic regions are not easily discoverable in the high-dimensional feature space. Typically, we expect that the true underlying clusters present in the data will cluster on different, potentially disjoint, subsets of features, and may not be discoverable when using the full feature set. This phenomenon can lead to slow convergence as the experts are unable to specialize on semantically similar regions of the data, poor robustness as data contamination can spuriously assign inputs to unsuitable experts, and degraded overall downstream performance due to suboptimal input-expert matching.

Contribution. In this work, we propose the Adaptive Clustering (AC) router and corresponding Adaptive Clustering Mixture-of-Experts (ACMo E), a novel Mo E method in which the router computes token-expert assignments in a transformed space that maximally identifies latent clusters in the data and more easily discovers the best-matched expert for each token. More specifically, we adaptively learn for each input which features best determine its cluster assignment and scale its features accordingly such that features that promote tight expert clusters are upweighted, and features that produce dispersed expert clusters are downweighted. This transformation accentuates the relevant characteristics of each input according to the specialization of the experts, thereby allowing the router to more easily discover the optimal input-expert allocation. Computing the routing assignments following this scheme produces three benefits: 1) faster convergence as experts are able to specialize more quickly by being allocated semantically similar inputs, 2) better robustness as latent clusters are better separated, thereby minimizing the risk that data corruption erroneously assigns tokens to unsuitable experts, and 3) better overall representations and downstream performance due to improved expert specialization. In order to discover the key features per token and their corresponding weights, we present a feature-weighted clustering optimization perspective on the Mo E framework and demonstrate how the clustering solution obtains the required feature weights. We show how these weights can be integrated into the routing mechanism such that routing takes place in a cluster-adaptive transformed space. We theoretically prove that our proposed routing mechanism learns the latent clustering structure of the data faster than standard routing mechanisms and that our mechanism is more robust to data contamination. Furthermore, our proposed method involves no learnable parameters and can be computed highly efficiently. In summary, our contributions are three-fold:

1. We develop the novel Adaptive Clustering router, a routing method in Mo E architectures that computes token-expert assignments in a transformed space that promotes separation of latent clusters in the data and more easily identifies the best-matched expert for each token.

Published as a conference paper at ICLR 2025

Figure 2: Fast Convergence of ACMo E. Left: Convergence speed on Wiki Text-103 pretraining using the Generalist Language Model (Du et al., 2022) backbone. Right: Convergence speed on Banking-77 finetuning using the Switch Transformer (Fedus et al., 2022) backbone. Across both backbones and tasks, we observe substantially faster convergence. We display final test perplexity (PPL) and accuracy (Acc.), showing better overall performance as well.

2. We propose a feature-weighted clustering optimization perspective on token-expert assignment and derive the optimal feature weights for adaptively transforming the input data for routing.

3. We derive a theoretical framework demonstrating how Mo E robustness and convergence depend on the shape of latent clusters and the clustering geometry of the input space.

We empirically demonstrate that 1) the Adaptive Clustering router outperforms baseline routing methods in Mo E architectures in large-scale tasks such as Wiki Text-103 language modeling and downstream finetuning, and Image Net-1k object classification in both clean and contaminated settings, 2) the Adaptive Clustering router exhibits faster convergence than baseline methods, and 3) the Adaptive Clustering router attains these performance improvements for free that is, with no learnable parameters and negligible computational overhead.

Preliminaries. We consider Transformer (Vaswani, 2017) based Mo E architectures and follow the approach of previous work where the Mo E layer is inserted after the self-attention layer within the Transformer, replacing the traditional feed-forward network (Fedus et al., 2022; Du et al., 2022; Liu et al., 2021). Let x be an input token with hidden representation h Rd and e1,e2,...e N Rd be the N learnable expert embeddings for model hidden dimension d. The Mo E layer selecting the top k experts is described by the following equations:

K = topkk(sk) = topkk(h ek) (1)

f SMo E(h) = h + k K g(h ek)f FFN k (h), (2)

where f FFN k is the kth expert feed-forward network, sk = h ek is the similarity score between token representation h and the kth expert ek and g( ) is a gating function often chosen as softmax, g(sk) = exp(sk)/ j K exp(sj). We refer to Eqn. 1 as the router, which learns the top k best matched experts per token, and Eqn. 2 as the overall standard Mo E layer.

Organization. We structure this paper as follows: In Section 2, we present a clustering optimization problem and show that its solution adaptively scales the feature space according to which dimensions promote tight clustering. In Section 3, we present how the solution to our clustering optimization problem can be built into our proposed AC router and we provide the full technical formulation of AC routing and Adaptive Clustering Mixture-of-Experts (ACMo E). We then present theoretical propositions on faster convergence and robustness. We empirically validate the advantages of ACMo E in Section 4 and discuss related work in Section 5. We end with concluding remarks and future work in Section 6. Proofs, technical details, and further experiments are provided in the Appendix.

2 A CLUSTERING OPTIMIZATION PERSPECTIVE

We begin by examining the Mo E router through the lens of feature-weighted clustering (Witten & Tibshirani, 2010; Friedman & Meulman, 2004; Brusco & Cradit, 2001; Gnanadesikan et al.,

Published as a conference paper at ICLR 2025

1995). We explicitly model the router s task as learning a token assignment that groups together similar tokens. We consider the role of learnable feature weights in solving a clustering optimization problem to optimally reveal latent clusters and present an analytical solution for the optimal weights for any given routing assignment. We finally discuss how this solution improves the Mo E router before providing the full formulation of our AC router and ACMo E in the next section.

2.1 CLUSTERING OPTIMIZATION

Let hi = [hi1,...,hid] be the ith hidden representation and Dij denote the distance between hi and hj. Given a distance metric ρijq between hiq and hjq over the qth dimension, the distance between hi and hj can be defined as Dij(w) = q [d] wqρijq for weights w = [w1,...,wd] with q [d] wq = 1 and wq 0 for all q [d]. The weights determine the global importance of the qth feature to the overall distance among representations.

Cluster analysis aims to divide the input set of N objects into groups, where objects within the same group are more similar to each other than to those in other groups. This is formalized using a classifier r(i) = k, assigning the ith object to a group k. Then the optimal classifier r minimizes a criterion Q(r) that evaluates clustering quality:

r = arg min r Q(r) = k [E]

1 N 2 k r(i)=k r(j)=k Dij(w). (3)

We expect that different groupings will cluster on different subsets of features. In particular, we wish to model the scenario that groupings exist in different latent subspaces with varying dependence on possibly disjoint subsets of features. We therefore replace the global feature weight w in Eqn. 3 with cluster-dependent feature weights, {wk}E k=1 for E groups, which allows us to capture the differing feature dependencies of each cluster. Then, we can adapt the optimization problem with these cluster-dependent feature weights as follows:

(r ,{w k}E k=1) =arg min r,{wk} k [E]

1 N 2 k r(i)=k r(j)=k DJ ij(wk),

such that q [d] wqk = 1, k [E], (4)

where DJ ij(wk) = d l=1 wqkρijq +λJ(wk) denotes the weighted distance between i and j combined with some regularization J and regularization strength λ.

To avoid point-mass solutions in which we assign all weight to the single best-clustering feature, we set the regularizer to the Kullback-Leibler divergence between the feature weights w and the uniform distribution u = (1/d,...,1/d) Rd, denoted by J(wk) = DKL(u wk). The regularization parameter λ reflects our preference to maintain more or less features in the solution set.

2.2 MOE AS CLUSTERING OPTIMIZATION

Within the Mo E framework with learnable routing, the router performs the role of the classifier r Rd [E], which is learned via gradient descent to optimize the final output loss1. Therefore, we modify Eqn. 4 by fixing r and focusing just on optimizing the criterion with respect to cluster-wise feature weights wk. Under this interpretation, the router learns via backpropagation to optimally allocate representations to experts, with representations adaptively transformed to maximally reveal the clustering structure of the input data. Eqn. 4 then becomes

{w k}E k=1 =arg min {wk} k [E]

1 N 2 k r(i)=k r(j)=k DJ ij(wk),

such that q [d] wqk = 1, k [E]. (5)

The following theorem presents the optimal weights per feature q and cluster k:

1A top-k router can straightforwardly be cast as the classifier in Eqn. 4 as r Rd [E]k

Published as a conference paper at ICLR 2025

Theorem 1 (Optimal feature weights). Let sqk = N 2 k r(i)=k r(j)=k ρijq be a measure of dispersion on the qth feature for the representations assigned to cluster k. Then, for a given router function r Rd [E], the corresponding optimal weights {wk}k [E] that minimize the featureweighted clustering optimization problem in Eqn. 5 are given by

wqk = λ/d sqk + αk (6)

for (q,k) [d] [E], where {αk}k [E] are constants that for any λ > 0 satisfy

1 sqk + αk = d

The existence of αk satisfying Eqn. 7 and the proof of Theorem 1 is provided in Appendix A.1. The optimal weights for a cluster k given in Eqn. 6 take an intuitive form in that they are inversely proportional to the measure of dispersion in cluster k along each dimension, wk [ 1

s1k ,..., 1 sdk ]. Hence, the optimal cluster-wise feature weights scale features according to their contribution to forming tight clusters. Specifically, the solution weights upweight a feature q if cluster k clusters tightly (has small dispersion sqk) along the feature q and downweights a feature p if cluster k clusters loosely (has large dispersion spk) along feature p.

This method enables the Mo E router to perform better token-expert matching. The cluster-wise feature weights wk capture the features on which the kth expert is specialized, as large weights indicate those features are highly important to the identification of that expert cluster and small weights indicate those features are unimportant to identification of that expert cluster. Then, we can use wk to scale the tokens to accentuate their features according to the specialization of the experts, thereby allowing the router to best identify the most suitable expert for each token. Note that this solution is local in that we learn the optimal weights adaptively per cluster, obtaining wk for all k [E], and so we compute a unique scaling of the feature space adaptively per cluster as well. Integrating these cluster-dependent weights which scale the feature space according to the identification of each expert into the Mo E router obtains our AC routing method and corresponding ACMo E. We detail the AC router and ACMo E fully in the next section.

3 A TIGHT CLUSTER IS A SPECIALIZED EXPERT

In this section, we demonstrate how we implement the solution weights from the clustering optimization problem in Eqn. 6 into the Mo E routing mechanism, thereby obtaining the Adaptive Clustering router. We then provide the full technical formulation of our proposed routing method and corresponding ACMo E model. We also present theoretical results on how computing the routing assignments according to our framework promotes faster convergence and robustness.

3.1 FULL TECHNICAL FORMULATION

We integrate the weights from Eqn. 6 into the Adaptive Clustering router transformation in Definition 1 which, for a cluster k, scales the dimensions of the feature space according to the kth expert s specialization on those features. Formally this is: Definition 1 (Adaptive Clustering Router Transformation Mk). Let Cℓ k = {hℓ 1,...hℓ Nk} be the representations assigned to expert k at layer ℓ. Let sℓ qk R be a measure of a spread in the qth

dimension for cluster k, such as mean absolute deviation sℓ qk = 1 Nk i Cℓ k hℓ iq hℓ q . Then, the cluster-dependent router transformation for expert k at layer ℓis given by a diagonal matrix M ℓ k = diag(1/sℓ 1k,...,1/sℓ dk).

We use the transformation Mk in Definition 1 to adaptively scale the feature space in which we perform token-expert matching. This obtains our Adaptive Clustering router and corresponding ACMo E layer, described in the following definition. Definition 2 (Adaptive Clustering Router and Mo E Layer). Let hℓ Rd be the hidden representation of an input, eℓ 1,...,eℓ N Rd be expert embeddings at layer ℓ. Let hℓ 1 Cℓ 1 k have been assigned

Published as a conference paper at ICLR 2025

to expert k in the previous layer. Let M ℓ 1 k Rd d be the Adaptive Clustering transformation (Definition 1) for input h at layer ℓ 1. Let g( ) be the softmax function. Then the following equations describe the Adaptive Clustering router (Eqn. 8) and overall ACMo E layer (Eqn. 9):

K = topkk(sk) = topkk(hℓ M ℓ 1 k eℓ k) (8)

f ACMo E(hℓ) = hℓ+ k K g(hℓ M ℓ 1 k eℓ k)f FFN,ℓ k (hℓ). (9)

Remark 1. We see from Eqns. 8 and 9 that the standard Mo E layer is recovered by setting the AC router transformation to the identity matrix, Mk = Id for all k [E]. Within our framework then, standard routing schemes implicitly assume all experts k [E] depend equally on all dimensions. Remark 2. The Adaptive Clustering router computes a dot-product between h and experts ek with the dimensions scaled by the weights in Mk and so is proportional to a Mahalanobis distance. Under this interpretation, we soft project the tokens and expert embeddings onto the axes of the feature space that best identify the expert cluster k .

Implementation details. Given ACMo E requires the expert assignment from the previous layer to compute the routing assignment (Eqn. 8), ACMo E is only implementable after the first layer. Furthermore, we scale the measures of dispersion in M ℓ k = diag(1/sℓ 1k,...,1/sℓ dk) to have mean 1. This is to remove the effect of different clusters or features having different absolute magnitudes. Our method is concerned with identifying the key sets of features that contribute more or less to identification of the expert clusters, and so we wish to compute our scaling in a relative sense.

3.2 ADAPTIVE CLUSTERING PROMOTES ROBUSTNESS AND FAST CONVERGENCE

We now present theoretical propositions on the improved robustness and convergence speed of our method. The robustness of our method follows from better separation of expert-clusters. This produces a more stable assignment in which the probability of erroneously sending a token to unsuitable nearby experts decays exponentially with increased inter-cluster distance. Faster convergence follows from our AC routing method improving the conditioning on the Hessian of the loss with respect to the expert embeddings, enabling faster and more stable convergence of the router.

Promoting robustness. We begin with Lemma 1 stating that our AC transformation (Definition 1) increases the separation between clusters in the transformed space, followed by Lemma 2, which provides an explicit expression for the probability of incorrect expert assignment. To give the probability bound an exact form, we assume the cluster structure can be modeled as a Gaussian mixture model (GMM). We note that GMMs are a highly expressive and general framework, so this assumption does not place significant restrictions on our robustness analysis. We further assume that though clusters may overlap, they are well-separated along the features for which they cluster tightly2. Lemma 1 (Adaptive Clustering Router Transformation Increases Cluster Separation). Let the data be generated from a Gaussian mixture model with components, gc = N(µc,Σc) for c [E]. Without loss of generality, consider two expert clusters c {a,b} where a token representation h ga belongs to cluster a. Let Ma = diag(1/s1a,...,1/sda) be the router transformation constructed from the feature-wise dispersions, sqa, of cluster ga for each feature q [d] as given by Definition 1. Then the distance between cluster means in the Ma-transformed space, defined as µk µa 2 Ma = (µk µa) Ma(µk µa), is larger than in the original Euclidean space: µk µa 2 Ma µk µa 2.

The proof is provided in Appendix A.2. In Lemma 2, we derive the probability of mis-assignment as a function of inter-cluster distance, showing how separation mitigates the effect of perturbations. Lemma 2 (Incorrect Assignment Probability). Let h Nk (µk ,Σk ) be a representation belonging to cluster k . Let h = h + ϵ be contaminated by some 0-mean noise ϵ (0,Σϵ). Let k be the nearest, incorrect cluster to k . Let the inter-cluster mean distance between k and k be given by δµ = µk µk . Let the routing assignment be given by r Rd [E] and denote the cumulative density of a standard normal distribution by Φ. Then the probability of incorrect assignment is:

Pr(r(h ) k ) = 1 Φ δµ 2

δµ (Σk + Σϵ)δµ

2Intuitively, this assumption captures the natural property that the semantic regions of the input space are distinct along the dimensions that best identify them.

Published as a conference paper at ICLR 2025

Remark 3. It is worth noting that since 1 Φ(x) (

2πx) 1e x2/2 for large x and

δµ (Σk + Σϵ)δµ = O( µ ), we find that the probability of incorrect cluster assignment as given by Eqn. 10, Pr(r(h ) k ) = e O( δµ 2) is an exponentially decreasing function in δµ .

The proof is provided in Appendix A.2. Combining Lemmas 1 and 2, we directly obtain that the probability of erroneous assignment using the AC router is exponentially smaller than under a standard routing scheme. This is formalized in Proposition 1, given by:

Proposition 1 (Robustness of ACMo E). Consider an expert assignment setting for the representation h Nk (µk ,Σk ) as in Lemma 2 with two routers given by r Rd [E] and r AC Rd [E] for standard (Eqn. 2) and AC routers (Definition 2), respectively. Then the probabilities of incorrect assignments of routers r and r AC satisfy Pr(r AC(h ) k ) Pr(r(h ) k ).

Promoting faster convergence. For an expert embedding ek Rd and associated cluster Ck, our AC router in Definition 2 adaptively spheres Ck by stretching the feature space with weights inversely proportional to the coordinate-wise dispersion in Ck. This reduces the conditioning number of the Hessian of the loss with respect to the expert ek, improving the loss landscape and enabling faster and more stable convergence of the router. This notion is formalized in Proposition 2:

Proposition 2 (Faster convergence of ACMo E). Let LMo E Θ R+ and LACMo E Θ R+ be the network loss functions defined on the whole parameter set Θ when employing the standard (Eqn. 2) and AC routers (Definition 2), respectively. Let κ(A) = λmax/λmin denote the conditioning number of a matrix A with largest and smallest eigenvalues λmax and λmin respectively. Let the Hessian of an ith expert be given by 2 ei. Then for each i [E] the following holds with high probability

κ( 2 ei LACMo E) κ( 2 ei LMo E) (11)

Remark 4. Faster convergence of ACMo E can also be argued from the perspective of learning Gaussian mixture models with Expectation Maximization (Dempster et al., 1977). The classic result of Ma et al. (2000) shows the convergence rate to the true parameters depends on the overlap between component Gaussians. Our AC method adaptively transforms the input space with by Mk (Definition 1), which decreases component overlap by increasing inter-cluster distances.

The proof is provided in Appendix A.3. We find this result empirically supported as shown by the rapid convergence in Fig. 2.

4 EXPERIMENTAL RESULTS

In this section, we empirically justify the advantage of ACMo E over baseline Mo E models. We evaluate our method on large-scale tasks including Wikitext-103 (Merity et al., 2016) language modeling and Image Net (Deng et al., 2009) object classification. We implement our AC router into Switch Transformer (Fedus et al., 2022), Generalist Language Model (GLa M) (Du et al., 2022), and Swin Transformer (Liu et al., 2021) backbones and compare our router against the standard Sparse Mixture-of-Experts (SMo E) router (Shazeer et al., 2017) and the XMo E router (Chi et al., 2022). We show that i) ACMo E obtains substantive improvements over baseline models across both language and vision tasks; ii) ACMo E offers robust improvements on contaminated and out-of-distribution samples; and iii) ACMo E attains these gains without introducing any learnable parameters and with negligible additional computational overhead. Results are averaged over 5 runs with different seeds.

4.1 LANGUAGE MODELING

Experimental Setup. We adopt the experimental setup of Pham et al. (2024). We compare ACMo E with Switch Transformer and GLa M baselines with 16 total experts in small (70M parameters) and medium (220M parameters) configurations with top-2 expert routing. We present pretraining test perplexity (PPL) results for Wikitext-103 and test bytes-per-character (BPC) for character-level En Wik-8 . We report top-1 accuracy for finetuning classification tasks on the 2-class Stanford Sentiment Treebank-2 (SST2) (Socher et al., 2013), 5-class Stanford Sentiment Treebank-5 (SST5) (Socher et al., 2013), and 77-class Banking-77 (B77) (Casanueva et al., 2020). Full experimental details are provided in Appendix C.

Published as a conference paper at ICLR 2025

Table 1: Test Perplexity (PPL) and bytes-per-character (BPC) pretraining and top-1 test accuracy on Stanford Sentiment Treebank 2, 5 (SST2, SST5), and Banking-77 (B77) finetuning classification.

Model Test BPC / PPL ( ) SST2 ( ) SST5 ( ) B77 ( )

En Wik-8 Pretrain

Switch Transformer (Fedus et al., 2022) 1.153 63.27 32.21 53.48 Switch-ACMo E (Ours) 1.137 64.45 33.79 54.26

Wiki Text-103 Pretrain

Switch Transformer (Fedus et al., 2022) 35.48 76.27 39.13 83.82 Switch-ACMo E (Ours) 34.42 77.32 40.04 86.01

GLa M (Du et al., 2022) 38.27 69.97 33.69 80.89 GLa M-ACMo E (Ours) 36.26 71.90 34.24 82.33

Table 2: Perplexity (PPL) on Wiki Text-103 contaminated by Text Attack.

Model Clean Test PPL ( ) Contaminated Test PPL ( )

Switch Transformer (Fedus et al., 2022) 35.48 48.12 Switch-ACMo E (Ours) 34.42 47.61

GLa M (Du et al., 2022) 38.27 50.84 GLa M-ACMo E (Ours) 36.26 47.91

Pretraining and Finetuning. Table 3 shows ACMo E attains top test PPL on Wiki Text-103 language modeling in Switch and GLa M backbones at small and medium configurations under baseline SMo E and XMo E routers. The improvement in the GLa M-medium architecture is a particularly substantive 4.8% over the next best baseline. Table 1 shows ACMo E pretrained models on both Wiki Text-103 and En Wik-8 surpass the performance of baselines in finetuning tasks, with strong, consistent improvements of approximately 3%, showing ACMo E s strong performance carries over to finetuning.

Table 3: Wiki Text-103 test PPL of ACMo E and baseline GLa M and Switch.

Router Test PPL ( )

Switch Transformer (Fedus et al., 2022)

SMo E-small (Shazeer et al., 2017) 87.94 XMo E-small (Chi et al., 2022) 87.21 ACMo E-small (Ours) 85.07

SMo E-medium (Shazeer et al., 2017) 35.48 XMo E-medium (Chi et al., 2022) 35.88 Stable Mo E-medium (Dai et al., 2022) 35.33 ACMo E-medium (Ours) 34.42

GLa M (Du et al., 2022)

SMo E-small (Shazeer et al., 2017) 58.27 XMo E-small (Chi et al., 2022) 54.80 ACMo E-small (Ours) 54.55

SMo E-medium (Shazeer et al., 2017) 38.27 XMo E-medium (Chi et al., 2022) 38.10 Stable Mo E-medium (Dai et al., 2022) 38.04 ACMo E-medium (Ours) 36.26

Robust Language Modeling. Table 2 show test PPL on Wiki Text-103 contaminated by Text Attack, where words are randomly swapped with a generic token AAA . We follow the setup of Han et al. (2024); Teo & Nguyen (2025b); Abdullaev & Nguyen (2025) and assess models by training them on clean data before attacking the test data using an attack rate of 2.5%. ACMo E outperforms baseline Switch and Gla M with particularly robust performance in the GLa M backbone, surpassing GLa M by 5.8%.

4.2 IMAGE CLASSIFICATION

Experimental Setup. We adopt the experimental setup of Liu et al. (2021) for pretraining and evaluation on Image Net. We evaluate ACMo E against the Swin Transformer baseline with 16 total experts in both top-1 and top-2 expert routing settings. The Swin backbone has 280M parameters. We additionally conduct experiments on Image Net under white box adversarial attacks fast gradient sign method (FGSM) (Goodfellow et al., 2014) and projected gradient descent (PGD) (Madry et al., 2017), and black box attack simultaneous perturbation stochastic approximation (SPSA) (Uesato et al., 2018). We also present results on out-of-distribution (OOD)(Hendrycks et al., 2021a;b). In all robust image classification tasks, image classification using Image Net-A/O/R we adopt the conventional setup of pretraining on Image Net and evaluating the trained models on the contaminated/OOD datasets (Han et al., 2024; Zhou et al., 2022a; Puigcerver et al., 2022; Nguyen et al., 2024; Nielsen et al., 2025). Full experimental details are provided in Appendix C.

Published as a conference paper at ICLR 2025

Table 4: Test Accuracy on Image Net corrupted PGD, FGSM, and SPSA.

Model Clean Data PGD FGSM SPSA Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 Top 5

Swin-Top 1 (Liu et al., 2021) 75.22 92.51 39.69 74.59 52.84 83.86 59.92 82.63 Swin-ACMo E-Top 1 (Ours) 75.39 92.56 40.66 73.46 53.43 82.80 59.97 82.47

Swin-Top 2 (Liu et al., 2021) 76.10 92.99 40.85 75.51 54.70 85.22 60.57 82.75 Swin-ACMo E-Top 2 (Ours) 76.31 93.14 43.74 78.55 55.78 85.80 63.47 86.05

Table 5: Test Accuracy on Image Classification in Imagenet-A/O/R

Model Im-A Im-R Im-O Top-1 Acc. ( ) Top-1 Acc. ( ) AUPR ( )

Swin Transformer-Top 1 (Liu et al., 2021) 6.83 30.60 17.89 Swin-ACMo E-Top 1 (Ours) 7.13 30.85 18.45

Swin Transformer-Top 2 (Liu et al., 2021) 9.38 32.07 18.51 Swin-ACMo E-Top 2 (Ours) 9.42 32.35 19.55

Image Classification under Adversarial Attack. Table 4 shows performance on Image Net classification against FGSM, PGD, and SPSA. Compared with Swin baseline, ACMo E-Top 2 attains noteworthy 7% and 5% improvements against PGD and SPSA in top-1 accuracy respectively.

Out-of-distribution Image Classification. Table 5 shows ACMo E improves over the baseline Swin Transformer in image classification on hard OOD and real-world adversarially filtered images. Evaluation on Image Net-A/O/R shows consistent improvements over the baseline in top-1 and top-2 expert choice, with particularly strong improvements in Image Net-O under top-2 routing with a performance gain in area under precision recall (AUPR) of almost 6%.

4.3 EMPIRICAL ANALYSIS

Load Balancing. We analyze in Table 6 the effect of ACMo E on expert load balancing. Load balance is calculated as the percentage of tokens assigned to each expert. The load balance score is then taken as the standard deviation over these percentages. A standard deviation of 0, where all experts are activated in exactly equal proportions, is therefore a perfect load balance. We compute this statistic per Mo E layer and present the overall load balance averaged over all layers. ACMo E attains better overall load balancing compared to Switch and Swin transformers. Against all backbones, ACMo E achieves a smaller spread in the load balances over layers, shown by smaller standard deviation. Visually we see how better expert specialization can aid load balance in Fig. 1, where better identification of the semantic regions of the input space leads to more experts being activated.

Efficiency Analysis. Computing the cluster-wise feature weights {wk}k [E] requires no learnable parameters and is obtained by computing the mean absolute deviation for each set of tokens assigned to the kth expert. This can be computed using just two computations of the mean one for the mean per cluster and one for the mean of the absolute deviations per cluster done in parallel over all clusters. This is of order O(2nd) = O(n) for n tokens, hence the upper-bound time complexity of the Mo E layer is unaffected. Table 7 provides empirical efficiency analysis in terms of compute speed, memory allocation, and parameters, which shows changes in speed and memory are within a margin of approximately 1% or less, implying there is no significant efficiency loss.

5 RELATED WORK

Routing Methods. Recent studies have proposed token-expert assignment algorithms based on reinforcement learning (Bengio et al., 2015), deterministic hashing (Roller et al., 2021), optimal transport (Liu et al., 2022), linear programs (Lewis et al., 2021), cosine similarity (Chi et al., 2022), soft token mixing (Puigcerver et al., 2023), greedy top-k experts per token (Shazeer et al., 2017) and greedy top-k tokens per expert (Zhou et al., 2022b). Existing work has predominantly considered dot-products between inputs and experts as a suitable metric for similarity (Lewis et al., 2021; Puigcerver et al., 2023; Shazeer et al., 2017; Zhou et al., 2022b; Chi et al., 2022). This work continues with dot-product based learnable routing but computes the routing assignments in an adaptively transformed space to maximally identify the latent expert clusters.

Published as a conference paper at ICLR 2025

Table 6: Load Balance Analysis of ACMo E and Baseline Mo E Models

Model Layer-Averaged Load Balance ( )

Switch Transformer (Fedus et al., 2022) 5.577 4.131 Switch-ACMo E (Ours) 5.317 2.622

GLa M (Du et al., 2022) 2.901 1.434 GLa M-ACMo E (Ours) 2.938 1.221

Swin Transformer (Liu et al., 2021) 2.134 1.110 Swin-ACMo E (Ours) 2.127 0.968

Table 7: Efficiency Comparison between ACMo E and baseline Mo E Models

Model Compute Speed (ms/it) Max Memory (K) #Params (M)

GLa M (Du et al., 2022) 422.62 25.69 220 GLa M-ACMo E (Ours) 425.15 25.72 220

Switch Transformer (Fedus et al., 2022) 391.93 34.64 216 Switch-ACMo E (Ours) 393.29 34.68 216

Swin Transformer (Liu et al., 2021) 403.36 22.00 280 Swin-ACMo E (Ours) 408.56 22.19 280

Mo E and Cluster Analysis. The Mo E framework traces its roots back to Gaussian mixture models where the input space is assumed divisible into separate regions with an expert specializing in each region (Jacobs et al., 1991). Recent studies show that the router can recover the clustering structure of the input space and each expert specializes in a specific cluster (Dikkala et al., 2023; Chen et al., 2022). Our work leverages the clustering perspective on Mo E to consider adaptive transformations of the input space to more easily distinguish latent clusters. We learn these transformations via feature-weighted cluster analysis, which has been studied in the clustering literature (Brusco & Cradit, 2001; Witten & Tibshirani, 2010; Gnanadesikan et al., 1995; Van Buuren & Heiser, 1989; Friedman & Meulman, 2004). Friedman & Meulman (2004) consider cluster-dependent feature weights to augment iterative clustering algorithms. Our approach similarly uses cluster-dependent feature weights but uses a different optimization problem to derive optimal weights.

Robust Mo E. The robustness of Mo E architectures is a newly emerging research area. Puigcerver et al. (2022) provide the first study in this direction from the perspective of model capacity and the Lipschitz constant, finding conditions under which Mo E models are provably more robust than their dense counterparts. Zhang et al. (2023) examine the effect of adversarial training and propose an alternating optimization adversarial defence. Teo & Nguyen (2024) integrates heavy-ball momentum in SMo E to improve the model s stability and robustness. Our work differs from these approaches by examining the robustness of Mo E models purely through the lens of the latent clustering structure of the input space. To the best of our knowledge, this is a novel lens on robustness in Mo E models.

6 CONCLUSION AND FUTURE WORK

In this paper, we present the Adaptive Clustering (AC) router and ACMo E layer, a novel Mo E routing method that computes token-expert assignments in a transformed space that maximally identifies latent clusters in the data and more easily discovers the best-matched expert for each token. We adaptively learn for each input which features are relevant to determining its latent cluster assignment and scale its features accordingly such that features that promote tight clustering are upweighted and features that produce dispersed clusters are downweighted. This transformation accentuates the relevant characteristics of each input according to the specialization of the experts, thereby allowing the router to more easily discover the optimal input-expert allocation. Our AC routing method enables faster convergence by improving the Hessian conditioning of the router and better robustness by increasing the separation of latent clusters in the transformed space. This approach makes no assumptions on the downstream task, requires no learnable parameters, and can be applied within any Mo E architecture to boost performance on clean and contaminated data. A limitation of our method is that the AC router requires estimates of each token s cluster assignment. We obtain these by using the expert assignments in previous layers, which means we require the embedding size to remain the same between adjacent Mo E layers. For ongoing work, we are investigating improved methods for estimating the latent cluster memberships without reliance on previous layers and with provable consistency guarantees.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS AND DISCLOSURE OF FUNDING

This research / project is supported by the National Research Foundation Singapore under the AI Singapore Programme (AISG Award No: AISG2-TC-2023-012-SGIL). This research / project is supported by the Ministry of Education, Singapore, under the Academic Research Fund Tier 1 (FY2023) (A-8002040-00-00, A-8002039-00-00). This research / project is also supported by the NUS Presidential Young Professorship Award (A-0009807-01-00).

Thanks to our anonymous reviewers, who provided valuable feedback which improved the paper substantially. Thanks also to Loi Xuan Ly for lending his eye for design.

Reproducibility Statement. Source code for our experiments are provided in the supplementary material. We provide the full details of our experimental setup including datasets, model specification, train regime, and evaluation protocol for all experiments in Appendix C. All datasets are publicly available.

Ethics Statement. Our work considers fundamental architectures, and in particular their robustness and convergence properties. Given this, we foresee no issues regarding fairness, privacy, or security, or any other harmful societal or ethical implications in general.

Laziz Abdullaev and Tan Minh Nguyen. Transformer meets twicing: Harnessing unattended residual information. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=16k G5a Nle S.

Dosovitskiy Alexey. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv: 2010.11929, 2020.

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. ar Xiv preprint ar Xiv:2106.08254, 2021.

Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models. ar Xiv preprint ar Xiv:1511.06297, 2015.

Michael J Brusco and J Dennis Cradit. A variable-selection heuristic for k-means clustering. Psychometrika, 66:249 270, 2001.

I nigo Casanueva, Tadas Temˇcinas, Daniela Gerz, Matthew Henderson, and Ivan Vuli c. Efficient intent detection with dual sentence encoders. ar Xiv preprint ar Xiv:2003.04807, 2020.

Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding the mixture-of-experts layer in deep learning. Advances in neural information processing systems, 35:23049 23062, 2022.

Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, et al. On the representation collapse of sparse mixture of experts. Advances in Neural Information Processing Systems, 35:34600 34613, 2022.

Damai Dai, Li Dong, Shuming Ma, Bo Zheng, Zhifang Sui, Baobao Chang, and Furu Wei. Stable Mo E: Stable routing strategy for mixture of experts. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7085 7095, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.489. URL https://aclanthology.org/2022.acl-long.489.

Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1): 1 22, 1977.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Published as a conference paper at ICLR 2025

Nishanth Dikkala, Nikhil Ghosh, Raghu Meka, Rina Panigrahy, Nikhil Vyas, and Xin Wang. On the benefits of learning to route in mixture-of-experts models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9376 9396, 2023.

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp. 5547 5569. PMLR, 2022.

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1 39, 2022.

Jerome H Friedman and Jacqueline J Meulman. Clustering objects on subsets of attributes (with discussion). Journal of the Royal Statistical Society Series B: Statistical Methodology, 66(4): 815 849, 2004.

Ram Gnanadesikan, Jon R Kettenring, and Shiao Li Tsao. Weighting and selection of variables for cluster analysis. Journal of classification, 12:113 136, 1995.

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Zhaopeng Tu, and Tao Lin. Dynamic mixture of experts: An auto-tuning approach for efficient transformer models. ar Xiv preprint ar Xiv:2405.14297, 2024.

Xing Han, Tongzheng Ren, Tan Nguyen, Khai Nguyen, Joydeep Ghosh, and Nhat Ho. Designing robust transformers using robust kernel density estimation. Advances in Neural Information Processing Systems, 36, 2024.

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 8340 8349, 2021a.

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15262 15271, 2021b.

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79 87, 1991.

Kenichi Kumatani, Robert Gmyr, Felipe Cruz Salinas, Linquan Liu, Wei Zuo, Devang Patel, Eric Sun, and Yu Shi. Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition. ar Xiv preprint ar Xiv:2112.05820, 2021.

D Lepikhin, H Lee, Y Xu, D Chen, O Firat, Y Huang, M Krikun, N Shazeer, and Z Gshard. Scaling giant models with conditional computation and automatic sharding. ar Xiv preprint ar Xiv:2006.16668, 2020.

Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning, pp. 6265 6274. PMLR, 2021.

Tianlin Liu, Joan Puigcerver, and Mathieu Blondel. Sparsity-constrained optimal transport. ar Xiv preprint ar Xiv:2209.15466, 2022.

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012 10022, 2021.

Jinwen Ma, Lei Xu, and Michael I Jordan. Asymptotic convergence rate of the em algorithm for gaussian mixtures. Neural Computation, 12(12):2881 2907, 2000.

Published as a conference paper at ICLR 2025

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. stat, 1050(9), 2017.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. ar Xiv preprint ar Xiv:1609.07843, 2016.

Tam Minh Nguyen, C esar A Uribe, Tan Minh Nguyen, and Richard Baraniuk. Pidformer: Transformer meets control theory. In Forty-first International Conference on Machine Learning, 2024.

Tan Minh Nguyen, Tam Minh Nguyen, Nhat Ho, Andrea L. Bertozzi, Richard Baraniuk, and Stanley Osher. A primal-dual framework for transformers and neural networks. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/ forum?id=U_T8-5h Cl V.

Viet Dung Nguyen, Minh Nguyen Hoang, Luc Nguyen, Rachel Teo, Tan Minh Nguyen, and Linh Duy Tran. CAMEx: Curvature-aware merging of experts. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=n T2u0M0nf8.

Stefan Nielsen, Laziz Abdullaev, Rachel SY Teo, and Tan Nguyen. Elliptical attention. Advances in Neural Information Processing Systems, 37:109748 109789, 2025.

Quang Pham, Giang Do, Huy Nguyen, Trung Tin Nguyen, Chenghao Liu, Mina Sartipi, Binh T Nguyen, Savitha Ramasamy, Xiaoli Li, Steven Hoi, et al. Competesmoe effective training of sparse mixture of experts via competition. ar Xiv preprint ar Xiv:2402.02526, 2024.

Joan Puigcerver, Rodolphe Jenatton, Carlos Riquelme, Pranjal Awasthi, and Srinadh Bhojanapalli. On the adversarial robustness of mixture of experts. Advances in Neural Information Processing Systems, 35:9660 9671, 2022.

Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts. ar Xiv preprint ar Xiv:2308.00951, 2023.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1 67, 2020.

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583 8595, 2021.

Stephen Roller, Sainbayar Sukhbaatar, Jason Weston, et al. Hash layers for large sparse models. Advances in Neural Information Processing Systems, 34:17555 17566, 2021.

N Shazeer, A Mirhoseini, K Maziarz, A Davis, Q Le, G Hinton, and J Dean. The sparsely-gated mixture-of-experts layer. Outrageously large neural networks, 2017.

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631 1642, 2013.

Rachel Teo and Tan Minh Nguyen. Momentum SMoe: Integrating momentum into sparse mixture of experts. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=y929es CZNJ.

Rachel Teo and Tan Minh Nguyen. Mo LEx: Mixture of layer experts for fine-tuning with sparse upcycling. In The Thirteenth International Conference on Learning Representations, 2025a. URL https://openreview.net/forum?id=r Wui9v Lh Oc.

Published as a conference paper at ICLR 2025

Rachel SY Teo and Tan Nguyen. Unveiling the hidden structure of self-attention via kernel principal component analysis. Advances in Neural Information Processing Systems, 37:101393 101427, 2025b.

Jonathan Uesato, Brendan O donoghue, Pushmeet Kohli, and Aaron Oord. Adversarial risk and the dangers of evaluating against weak attacks. In International conference on machine learning, pp. 5025 5034. PMLR, 2018.

Stef Van Buuren and Willem J Heiser. Clustering n objects into k groups under optimal scaling of variables. Psychometrika, 54:699 706, 1989.

A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.

Daniela M Witten and Robert Tibshirani. A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490):713 726, 2010.

Yihua Zhang, Ruisi Cai, Tianlong Chen, Guanhua Zhang, Huan Zhang, Pin-Yu Chen, Shiyu Chang, Zhangyang Wang, and Sijia Liu. Robust mixture-of-expert training for convolutional neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 90 101, 2023.

Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Animashree Anandkumar, Jiashi Feng, and Jose M Alvarez. Understanding the robustness in vision transformers. In International Conference on Machine Learning, pp. 27378 27394. PMLR, 2022a.

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103 7114, 2022b.

Published as a conference paper at ICLR 2025

Supplement to Tight Clusters Make Specialized Experts

Table of Contents

A Technical Proofs 15

A.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.2 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

A.2.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

A.2.2 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

A.3 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

B Implementation Procedure and Computational Efficiency 20

C Experimental Details and Additional Experiments 20

C.1 Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

C.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

C.1.2 Model, Optimizer, & Train Specification . . . . . . . . . . . . . . . . . . . . . 20

C.2 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

C.2.1 Datasets and Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

C.2.2 Model, Optimizer, & Train Specification . . . . . . . . . . . . . . . . . . . . . 22

C.3 Adversarial Attack At Higher Perturbation Budget . . . . . . . . . . . . . . . . . . . . 22

C.4 Cluster Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

C.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

C.5.1 Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

C.5.2 Layer Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

C.5.3 Random Ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

C.6 Cluster Weight Mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

C.7 Adaptive Clustering Integration into Soft Mixture of Experts . . . . . . . . . . . . . . 25

C.8 Image Classification in Swin Transformer Base Configuration . . . . . . . . . . . . . 26

C.9 Router Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

C.10 Dynamic Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

D Broader Impact 27

A TECHNICAL PROOFS

A.1 PROOF OF THEOREM 1

To begin with, we present the following lemma to show the existence of constants αk for k [E] that satisfy Eqn. 7:

Lemma 3. For any λ > 0, Eqn. 7 has exactly d real solutions with respect to αk.

Published as a conference paper at ICLR 2025

Proof of Lemma 3. Without loss of generality, assume that s1k s2k sdk. Denote

φ(α) = q [d]

1 sqk + α d

Then, the existence of solutions to Eqn. 7 is equivalent to the condition φ(αl) = 0. Note that φ(α) is a strictly decreasing function in its connected continuity domains since

φ (α) = q [d]

1 (sqk + α)2 < 0 (13)

for all α R { s1k,..., sdk}. Further, we observe that

lim α s qk φ(α) = , lim α s+ qk φ(α) = + (14)

for all q [d], and

lim α φ(α) = d

λ < 0. (15)

Now consider the domain of continuity of φ(α), namely ( , s1k) ( s1k, s2k) ( sdk, ). Due to the monotonicity and limits 14 & 15, there exists a unique solution in each of the intervals except for ( , s1k) where the function is always strictly negative, thus, yielding d roots in total.

Now we follow up with the main proof of this section.

Proof of Theorem 1. First, let Ik = {i r(i) = k} for convenience. Now let us restate the clustering optimization problem (4) here once again:

min wk Q(c,{wk}k [E]) = k [E]

1 N 2 k i,j Ik q [d] (wqkρijq + λ

d log 1 dwqk ),

such that q [d] wqk = 1, k [E], (16)

where we have immediately used the fact that

DKL(u wk) = q [d]

1 d log 1/d

Also, note that

q [d] (wqkρijq + λ1

d log 1 dwqk ) = q [d] (wqkρijq λ1

d log(dwqk))

= q [d] (wqkρijq λ

d log wqk) λlog d. (18)

We can ignore the term λlog d since it does not depend on the optimization variable. Method of Lagrange multipliers turns this constrained optimization problem into the following unconstrained counterpart:

min wk,αL(c,{wk}k [E],α) = k [E]

1 N 2 k i,j Ik q [d] (wqkρijq λ

d log wqk) + k [E] αk q [d] wqk 1 ,

where α = [α1 ... αL] is the vector of Lagrange multipliers. Note that the last optimization problem can be separated into the following L independent optimization subproblems:

min wk,αLk(c,wk,α) = 1

N 2 k i,j Ik q [d] (wqkρijq λ

d log wqk) + αk q [d] wqk 1 ,

Published as a conference paper at ICLR 2025

for k [E]. Since the objective function is a positive combination of convex functions, the optimization problem is also convex. By setting the derivatives of Lk with respect to both optimization variables to 0, we obtain the following system of equations:

Lk wqk = sqk λ

d 1 wqk + αk = 0,

Lk αk = q [d] wqk 1 = 0

for all k [E], where sqk is the data dispersion measure defined in the theorem statement. The first equation yields

d 1 sqk + αk , (19)

where αk is found from q [d] wqk = 1 which in fact gives

1 sqk + αk = d

for all k [E] as desired.

A.2 PROOF OF PROPOSITION 1

Since Proposition 1 is a composition of Lemma 1 and Lemma 2, we proceed by providing their proofs.

A.2.1 PROOF OF LEMMA 1

Proof of Lemma 1. Notice that we can expand inequality (1) as

i [d] miδµ2 i i [d] δµ2 i ,

where we let δµ = µb µa. Since Ma entries are mean-scaled, we can rewrite them as

mi = dm i j [d] m j (21)

for some initial dispersion estimates {m j}j [d]. Without loss of generality, assume that [d ] is the set of dimension indices for which the dispersions are relatively much smaller than those in the rest of the dimensions in the sense that m i m j for any i [d ] and j [d] [d ]. Then, there exists a positive α 1/2 such that i [d ] mi > d α and i [d] [d ] mi < α. By the assumption that clusters are best-separated along the features for which they cluster tightly, this means that the weight matrix Ma maximizes the contribution of largest d terms in i [d] miδµ2 i corresponding to individual feature-wise distances in dimensions where the feature dispersions are the smallest instead of giving uniform weights to all dimensions, which leads to inequality (1).

A.2.2 PROOF OF LEMMA 2

Proof of Lemma 2. Since we use the L2 distance between the token h and µc as a similarity metric, we assign cluster gk to the token h iff h µk h µk . Assume that the token h is a noisy observation of an underlying true token h which actually originates from cluster gk . Then, the token h can be decomposed as h = h + ϵ for a random noise ϵ N(0,Σϵ). Now define the decision variable D(h ) = h µk 2 h µk 2 which turns the clustering condition to D(h ) 0 for the cluster gk . Let us analyze the decision variable D as a random variable where randomness may come from the underlying sampling strategy and noise. Note that

D(h ) = h + ϵ µk 2 h + ϵ µk 2

= h µk 2 h µk 2 + 2(µk µk ) ϵ

= D(h) + 2δµ ϵ, (22)

Published as a conference paper at ICLR 2025

where δµ = µk µk . Due to the assumption that h is drawn from the distribution gk , it can be rewritten as h = µk + ν with ν N(0,Σk ). Then for the first term in Eqn. 22, we have

D(h) = h µk 2 h µk 2

= δµ (2h µk µk)

= δµ (2ν δµ)

= 2δµ ν δµ 2. (23)

Substituting this back into Eqn. 22, we get

D(h ) = 2δµ (ν + ϵ) δµ 2. (24)

This shows that D(h ) N ( δµ 2,4δµ (Σk + Σϵ)δµ). Since D(h ) follows a normal distribution with the derived parameters, the probability that h is assigned to cluster gk is given by

Pr(correct cluster) = Pr(D(h) 0) = Φ δµ 2

δµ (Σk + Σϵ)δµ

where Φ denotes the CDF of normal distribution as usual. Since Φ is an increasing function, the probability that the noisy token h is assigned to the correct cluster is proportional to the distance between the cluster centroids and inverse proportional to the covariance matrices of the cluster and the additive noise. On the other hand, for the incorrect clustering probability, we have

Pr(incorrect cluster) = 1 Φ δµ 2

δµ (Σk + Σϵ)δµ

as claimed.

A.3 PROOF OF PROPOSITION 2

Proof of Proposition 2. Let the router be given by g and let the softmax function be given by gθ Rd Rd, parameterized by expert embeddings {ei}i [E]. The network loss depends on expert embeddings only through the router function g. We shall explore the exclusive contribution of each expert embedding in minimizing LACMo E. In order to do this, we look at the network loss as a scalar function of ith expert embedding vector while treating all other network parameters as fixed. Then, we can write LACMo E Rd R such that LACMo E = LACMo E(gθ(ei)). For simplicity, we shall omit the subscript θ. The gradient that comes from back-propagation is then given by

ei LACMo E = ( g LACMo E) eig, (27)

where eig Rd d denotes the Jacobian matrix of g since for gk = (gθ(ei))k, we can write

eis LACMo E(g1,...,gd) = k

gk eis . (28)

Note that for gk = softmax(h Mek), we have

gk eis = mshsgk(δki gi) = mshsbki. (29)

Then, the element of the Hessian matrix of the network loss at index (s,t) [d] [d] can be written as

H(i) st (LACMo E) = 2LACMo E

eis eit = eit k

gk eis + LACMo E

2gk eis eit

= mshsmtht k

gk gj bji bki + LACMo E

= mshsmtht Bi, (30)

Published as a conference paper at ICLR 2025

where Bi is some constant that depends only on index i. Due to Eqn. 30, the Hessian takes the following matrix form

H(i) = Bi(Mh)(Mh) . (31) Taking expectation from both sides, we obtain

Eh (µ,Σ) [H(i)] = Bi Eh (µ,Σ) [M(hh )M] = Bi M(Σ)M, (32)

where we assume h is centered. Now recall that M = diag(m1,...,md) where for each i, mi 1/ Σii holds. Assume that the covariance matrix Σ is symmetric positive definite. Then, it is diagonalizable as Σ = UΛU with Λ = diag(λ1,...,λd), a diagonal matrix with eigenvalues of Σ. With the transformation M, we get MΣM = MUΛU M = UMΛMU (33)

m2 1λ1 m2 dλd

Since the eigenvalues capture the variances along the principal components of the covariance matrix, m2 i , as a reciprocal of a measure of dimension-wise dispersion, is reasonably correlated with 1/λi, as demonstrated by Lemma 4, implying λj λi Ô mj mi with high probability. Therefore, we obtain that

κ(MΣM) = λmax(MΣM)

λmin(MΣM) m2 minλmax(Σ) m2maxλmin(Σ) κ(Σ), (35)

which implies the claim. Lemma 4 (Correlation between dimension-wise varainces and covariance eigenvalues). Let {bi}i d be the set of normalized basis vectors of Rd. Consider a symmetric positive definite covariance matrix Σ and its unit eigenvectors {vi}i [d]. Assume that the eigenvector vi is a reasonably small perturbation of the basis vector bi such that v i bi 1 ϵ for all i [d] and a small constant ϵ > 0. Then, for all i [d], we have λi Σii ϵ max j i λi λj , (36)

where {λi}i [d] is the set of ordered eigenvalues of Σ corresponding to eigenvectors {vi}i [d].

Proof of Lemma 4. Note that each diagonal element of the SPD covariance matrix Σ can be written as

Σii = b i Σbi = b i j [d] λjvjv j bi = j [d] λj(v j bi)2. (37)

Then, the difference on the left hand side of Eqn. 36 can be bounded as

λi Σii = RRRRRRRRRRRR λi j [d] λj(v j bi)2 RRRRRRRRRRRR = RRRRRRRRRRR λi (1 (viei)2) j i λj(v j bi)2 RRRRRRRRRRR

= RRRRRRRRRRR λi j i (v j bi)2 j i λj(v j bi)2 RRRRRRRRRRR (38)

= RRRRRRRRRRR j i (λi λj)(v j bi)2 RRRRRRRRRRR max j i λi λj j i (v j bi)2

= max j i λi λj (1 (vibi)2) (39)

ϵmax j i λi λj ,

where we used the fact that

j [d] (v j bi)2 =

n j=1 (v j bi)vj

( n k=1 (v kbi)vk) = b b = 1

to obtain Eqn. 38 and Eqn. 39 since the eigenvectors of Σ are orthonormal.

Published as a conference paper at ICLR 2025

B IMPLEMENTATION PROCEDURE AND COMPUTATIONAL EFFICIENCY

Training and Inference. Given the AC routing scheme requires requires the expert assignment per token from the previous layer, we can only implement AC routing from the second layer on. We incorporate AC routing into both training and inference stages. This is because, firstly, AC routing is designed to offer improvements to both clean and contaminated data, and so even in the presence of completely clean train and test data, it is advantageous to incorporate the AC method into both stages. Secondly, it is commonplace to encounter data contamination only at the test stage and indeed highly possible to encounter it in train as well. Therefore, in the interest of robustness as well, AC routing is incorporated into both stages.

Computational Efficiency. Computing the required {wk}k [E] for number of experts E requires no learnable parameters and is obtained simply by computing the mean absolute deviation for each set of tokens assigned to the kth expert. This can be computed using just two computations of the mean once for the mean per cluster and once again for the mean of the absolute deviations per cluster done in parallel over all clusters using torch.index reduce() and is of the order O(2nd) = O(n) for n tokens. Hence the upper-bound time complexity of the Mo E layer is unaffected. We provide in Table 7 additional efficiency analysis in terms of throughput, max GPU memory allocated, and parameters which shows no significant efficiency loss compared to baseline Mo E architectures.

C EXPERIMENTAL DETAILS AND ADDITIONAL EXPERIMENTS

C.1 LANGUAGE MODELING

C.1.1 DATASETS

Wiki Text-103. The Wiki Text-1033 dataset contains around 268K words and its training set consists of about 28K articles with 103M tokens. This corresponds to text blocks of about 3600 words. The validation set and test sets consist of 60 articles with 218K and 246K tokens respectively.

En Wik-8. The En Wik-8 dataset is a byte-level dataset of 100 million bytes derived from Wikipedia that, in addition to English text, also includes markup, special characters, and text in other languages. En Wik-8 contains 90M characters for training, 5M for validation, and 5M for testing.

Stanford Sentiment Treebank-2. The Stanford Sentiment Treebank-2 (SST2) (Socher et al., 2013) is a 2 class corpus with fully labeled parse trees for analysis of the compositional effects of sentiment in language. The dataset consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes 215,154 unique phrases from the parse trees, each annotated by 3 human judges.

Stanford Sentiment Treebank-5. Stanford Sentiment Treebank-5 (SST5) (Socher et al., 2013) is a 5 class dataset used for sentiment analysis. It consists of 11,855 single sentences extracted from movie reviews. It includes 215,154 unique phrases from parse trees, each annotated by 3 human judges. Phrases are classified as negative, somewhat negative, neutral, somewhat positive, or positive.

Banking-77. Banking-77 (B77) (Casanueva et al., 2020) is a highly fine-grained 77 class classification dataset comprising 13083 customer service queries labelled with 77 intents.

C.1.2 MODEL, OPTIMIZER, & TRAIN SPECIFICATION

Models. We use as backbones the Switch Transformer (Fedus et al., 2022) and Generalist Language Model (Du et al., 2022). Table 8 contains the specification over self-attention (SA) layers, feed-forward network (FFN) layers, Mixture-of-Experts (Mo E) layers, attention span (Att. Span),

3www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/

Published as a conference paper at ICLR 2025

embedding size and parameter count for both backbones at small and medium configurations for each pretraining task. All backbones use 16 experts with top-2 expert routing.

Table 8: Language Modeling Backbone Specifications

Model SA Layers FFN Layers Mo E Layers Att. Span Embed Size Params

Wiki Text-103 Pretrain

Switch-small 3 - 3 256 128 70M Switch-medium 6 - 6 1024 352 216M

GLa M-small 6 3 3 2048 144 79M GLa M-medium 12 6 6 2048 352 220M

En Wik-8 Pretrain

Switch 8 - 8 2048 352 36M

Optimizer. All experiments use Adam with a base learning rate of 0.0007. Small configurations use 3000 iterations of learning rate warmup while medium configurations use 4000 iterations.

Pretrain Specification. For Wiki Text-103 pretraining, small Switch backbones are trained for 40 epochs with a batch size of 96 and medium Switch backbones are trained for 80 epochs with a batch size of 48. Small GLa M backbones are trained for 60 epochs with a batch size of 48 and medium GLa M backbones are trained for 120 epochs with a batch size of 48. We use 0.01 auxiliary load balancing loss.

For En Wik-8 pretraining, both Switch and GLa M backbones are trained for 80 epochs with batch size 48. We use 0.01 auxiliary load balancing loss.

Finetune Specification. For SST2 and SST5 finetuning, we finetune for 5 epochs using Adam and a base learning rate of 0.001 without warmup and a batch size of 16. For B77 we finetune for 50 epochs using Adam and a base elarning rate of 0.00001 without warmup and a batch size of 16.

Compute Resources. All models are trained, evaluated, and finetuned on four NVIDIA A100 SXM4 40GB GPUs.

C.2 IMAGE CLASSIFICATION

C.2.1 DATASETS AND ATTACKS

Image Net-1K. We use the full Image Net dataset that contains 1.28M training images and 50K validation images. The model learns to predict the class of the input image among 1000 categories. We report the top-1 and top-5 accuracy on all experiments.

Image Net-A/O/R. Image Net-A (Hendrycks et al., 2021b) contains real-world adversarially filtered images that fool current Image Net classifiers. A 200-class subset of the original Image Net1K s 1000 classes is selected so that errors among these 200 classes would be considered egregious, which cover most broad categories spanned by Image Net-1K.

Image Net-O (Hendrycks et al., 2021b) contains adversarially filtered examples for Image Net out-ofdistribution detectors. The dataset contains samples from Image Net-22K but not from Image Net1K, where samples that are wrongly classified as an Image Net-1K class with high confidence by a Res Net-50 are selected.

Imagenet-R (Hendrycks et al., 2021a) contains various artistic renditions of object classes from the original Image Net dataset, which is discouraged by the original Image Net. Image Net-R contains 30,000 image renditions for 200 Image Net classes, where a subset of the Image Net-1K classes is chosen.

Published as a conference paper at ICLR 2025

Adversarial Attacks. We use produce corrupted Image Net samples using white box attacks fast gradient sign method (FGSM) (Goodfellow et al., 2014) and projected gradient descent (PGD) (Madry et al., 2017), and black box simultaneous perturbation stochastic approximation (SPSA) (Uesato et al., 2018). FGSM and PGD use a perturbation budget of 1/255 while SPSA uses a perturbation budget 1. All attacks perturb under l norm. PGD and uses 20 steps with step size of 0.15 and SPSA uses 20 iterations.

C.2.2 MODEL, OPTIMIZER, & TRAIN SPECIFICATION

Models. Our results are based off of the Swin Transformer (Liu et al., 2021) architecture. This backbone uses 4 base layers of depth 2, 2, 18, and 2. The first two base layers each contain 2 selfattention layers and 2 feed-forward layers. The third base layer contains 18 self-attention layers with alternating feed-forward and Mo E layers. The final base layer contains 2 self-attention layers with one feed-forward and one Mo E layer. The embedding dimension is 96 and the heads per base layer are 3, 6, 12, and 24. We use 16 total experts and present results for both top-1 and top-2 expert routing. The total parameter count is 280M.

Optimizer. We use Adam W with a base learning rate of 1.25e-4, minimum learning rate of 1.25e7, 0.1 weight decay and cosine scheduling.

Train Specification. We train for 60 epochs with a batch size of 128 and 0.1 auxiliary balancing loss.

Compute Resources. All models are trained and evaluated on four NVIDIA A100 SXM4 40GB GPUs.

C.3 ADVERSARIAL ATTACK AT HIGHER PERTURBATION BUDGET

Figure 3: ACMo E and Swin Transformer under PGD attack at increasing perturbation budgets. ACMo E widens its performance gain over Swin at increasingly severe attacks in both top-1 test accuracy (left) and top-5 test accuracy (right), starting at approximately 7% improvement at 1/255 and ending at just over 10% at 5/255.

Figure 3 shows that for PGD perturbation budgets 1/255 through to 5/255, ACMo E widens its already substantive robust performance gain over Swin, with top-1 and top-5 test accuracy improvements increasing from 7% to approximately 10%.

C.4 CLUSTER VISUALIZATION

We pass random Image Net batches through Swin and ACMo E and plot the representations along with their assigned experts, using t-sne to represent the high dimensional data in 2 dimensions. The result is shown in Fig. 4, where we see Swin learns overlapping and indistinguishable expert clusters. ACMo E, on the other hand, performs better in learning the clusters, producing much clearer and better-distinguished clusters.

Published as a conference paper at ICLR 2025

Figure 4: Cluster Visualization on Image Net. Each token is represented as a point and colored by its assigned expert. Left: Swin identifies one cluster clearly (yellow/gold) but otherwise fails to distinguish remaining clusters Right: ACMo E learns better-defined expert clusters.

Table 9: Ablation on Measure of Spread in Switch Transformer (Fedus et al., 2022)

Measure of Spread Test PPL ( )

Variance 34.87 MAD 34.42

Table 10: Ablation on Layer Placement in Switch Transformer (Fedus et al., 2022)

Layer Placement Test PPL ( )

Back Half 34.95 Alternating 34.80 Skip 1 34.42 Full 34.88

C.5 ABLATION STUDIES

C.5.1 MEASURES OF DISPERSION

We present in Tables 9 and 11 results for Switch-ACMo E and Swin-ACMo E when changing the measure of dispersion used in the AC routing transformation (Definition 1) from mean absolute deviation (MAD) to variance. We see mean absolute deviation outperforms variance as a measure of spread. This is an intuitive finding given that squared distances, as used in variance computations, are highly sensitive to outliers. Using mean absolute deviation as an alternative measure of spread reduces this issue and produces a more robust estimate of dispersion. We note that MAD is not the only robust measure of spread. We conjecture that taking interquartile range as an additionally robust measure of spread may produce good results in both clean and contaminated data. We, however, leave this interesting direction to future research as interquartile range poses implementation challenges as it requires designing concurrent linear scans over the expert clusters. MAD, by contrast, requires just two computations of the mean which is easily parallelizable using torch.index reduce().

C.5.2 LAYER PLACEMENT

We consider the effect of layer placement in the Switch-medium configuration and in the Swin Transformer (see Sections C.1.2 and C.2.2 for the full model specifications). In particular, Switch is a 6 layer model and Swin is a 24 layer model. With regard to Swin, we focus on the deepest block of depth 18 to implement our ACMo E layers. This is due to the change in embedding size between base layers, meaning we are restricted to this base layer of depth 18. Note further that Swin only uses Mo E layers in an alternating pattern with feed-forward networks between each Mo E layer. For example, for Switch, a full ACMo E specification would mean placing ACMo E on layers 2,3,4,5,6. For Swin, a full specification means placing ACMo E on layers 4,6,8,10,12,14,16,18. To examine the effect of layer placement we consider the following models:

Published as a conference paper at ICLR 2025

Table 11: Ablation on Measure of Spread in Swin Transformer

Measure of Spread Test Acc. Top 1 Top 5

Swin-Top1 (Liu et al., 2021)

Variance 75.06 92.49 MAD 75.39 92.56

Swin-Top2 (Liu et al., 2021)

Variance 76.11 93.08 MAD 76.31 93.14

Table 12: Ablation on Layer Placement in Swin Transformer

Layer Placement Test Acc. Top 1 Top 5

Swin-Top1 (Liu et al., 2021)

Back Half 75.16 92.46 Skip 2 75.34 92.42 Skip 1 75.35 92.45 Full 75.39 92.56

Swin-Top2 (Liu et al., 2021)

Back Half 76.16 93.02 Skip 2 76.10 92.93 Skip 1 76.29 92.98 Full 76.31 93.14

Alternating: For Switch this means we place ACMo E on layers 2,4,6. For Swin this means we place ACMo E on layers 4,8,12,16. Back Half: For Switch this means we place ACMo E on just the last 3 layers of the network. For Swin this means we place ACMo E on just the last 5 layers of the network. Skip 2: For Swin this means we palce ACMo E on layers 8,10,12,14,16,18. Skip 1: For Switch this means we place ACMo E on layers 3,4,5,6. For Swin this means we place ACMo E on layers 6,8,10,12,14,16,18. Full: We place ACMo E on every possible layer.

We present in Table 10 results for Switch and Swin ACMo E models when changing the positions of the ACMo E layers throughout the network. The results agree with our expectation that, generally speaking, more ACMo E layers improve performance, but a in some circumstances a threshold is met at the point where ACMo E layers are used too early in the network such that the model has not been able to learn reasonably good approximations of the cluster membership of the tokens yet.

We find that in the Switch backbone, performance improves the more ACMo E layers we add, which agrees with our expectation that more ACMo E layers improve performance. However, we find that top performance is attained when allowing two standard Mo E layers to go before the first ACMo E, as opposed to the minimum of 1 standard Mo E layer. We conjecture this is because we need to give the model a few layers before the first ACMo E in order to learn decent representations such that we have good enough estimated cluster assignments for use in the ACMo E layer. Encouragingly, we find just one additional standard Mo E layer is sufficient for the benefits of ACMo E to be obtained.

We find in Table 12 that with Swin, best performance is obtained using ACMo E on every possible layer, again agreeing with our expectation that more ACMo E layers improve performance. With Swin, however, we do not face any drop in performance from placing ACMo E too early in the network, and indeed we see Full attaining top performance. We conjecture that Swin does not encounter this issue since Swin uses four layers of feed forward networks before the first Mo E layer, and so by the first Mo E layer the representations are of reasonably good quality to produce good estimates of the cluster membership.

C.5.3 RANDOM ABLATION

We show the efficacy of the adaptive clustering transformation M (Definition 1) in our AC router at capturing meaningful feature-wise information by ablating it against an alternate d d diagonal matrix made up of normal random variables with mean 1 and standard deviation 0.5 (where we clip any negative values to prevent negative weights). We present in Tables 13 and 14 results for language modeling (using Switch) and image classification (using Swin), which show fairly substantial drops in performance in both backbones. This offers evidence to the claim that our AC routing transformation is meaningfully weighting features to improve routing, and that performance gains

Published as a conference paper at ICLR 2025

of our proposed method do not flow from a kind of implicit regularization of introducing noise into the router.

Table 13: Random Ablation in Switch (Fedus et al., 2022)

Model Test PPL ( )

Switch-Random (Fedus et al., 2022) 38.17 Switch-ACMo E 34.42

Table 14: Random Ablation in Swin (Liu et al., 2021)

Model Top 1 Acc. Top 5 Acc.

Swin-Random 74.22 91.87 Swin-ACMo E 76.31 93.14

C.6 CLUSTER WEIGHT MIXING

The AC routing scheme estimates the cluster membership of each token based on its highest affinity cluster assigned in the previous layer. We could also further leverage the top-k structure of the Mo E models by mixing the cluster-wise feature weights with weights corresponding to the affinities in the top-k routing. For example, if h has affinity scores α and 1 α to clusters k and k respectively, then we could also obtain the required AC routing transformation for h as Mk = αMk + (1 α)Mk . This approach therefore factors in the confidence with which we believe h belongs to cluster k or k , and can be used for integrating ACMo E into higher expert granularity backbones (i.e higher top-k settings). Tables 15 and 16 show results for computing Mk by mixing the top-affinity cluster weights (Mix 2) in Switch and GLa M with top-2 routing, versus our presented results which compute Mk just based off of the highest affinity cluster (Mix 1). We see that GLa M-ACMo E benefits substantially from cluster weight mixing whereas Switch-ACMo E prefers just using its top affinity cluster weights. For consistency across models, we present in our main body the Mix 1 results, as GLa M-ACMo E already performs extremely strongly using Mix 1 and so we prefer to opt for the added performance gain in the Switch backbone.

Table 15: Results on Cluster Weight Mixing in Switch (Fedus et al., 2022)

Clusters Mixed Test PPL ( )

Mix 2 34.66 Mix 1 34.42

Table 16: Results on Cluster Weight Mixing in GLa M (Du et al., 2022)

Clusters Mixed Test PPL ( )

Mix 2 35.29 Mix 1 36.26

C.7 ADAPTIVE CLUSTERING INTEGRATION INTO SOFT MIXTURE OF EXPERTS

We present here results for integrating ACMo E into Soft Mo E (Puigcerver et al., 2023). To use ACMo E in the Soft Moe setting, which can be be understood as a top-E routing setting where all experts are active for every token, we compute Mk using cluster weight mixing (Section C.6) over the top-8 highest affinity clusters. We present the performance of Soft-ACMo E on clean data, adversarially attacked data, and Image Net-A/O/R in the following Tables 17 and 18.

Table 17: Test Accuracy on Image Net corrupted PGD, FGSM, and SPSA using Soft Mo E (Puigcerver et al., 2023) backbone

Model Clean Data PGD FGSM SPSA Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 Top 5

Soft Mo E (Puigcerver et al., 2023) 72.86 90.92 45.29 78.91 56.95 85.60 66.59 88.70 Soft-ACMo E (Ours) 73.21 91.23 48.25 80.49 59.01 86.69 70.63 93.22

We see in Tables 17 and 18 the efficacy of ACMo E in the Soft Mo E backbone, offering evidence of the adaptability of our framework into further Mo E setups. In particular, the Soft Mo E framework models a setting in which expert clusters are highly overlapping, as each token is soft assigned to all experts. Therefore, the performance gains shown in clean and contaminated data of Soft-ACMo E demonstrates that our AC router is well-suited to modeling such a clustering structure.

Published as a conference paper at ICLR 2025

Table 18: Test Accuracy on Image Classification in Imagenet-A/O/R using Soft Mo E (Puigcerver et al., 2023) backbone

Model Im-A Im-R Im-O Top-1 Acc. ( ) Top-1 Acc. ( ) AUPR ( )

Soft Mo E (Puigcerver et al., 2023) 6.69 31.63 17.97 Soft-ACMo E (Ours) 6.93 32.18 18.35

C.8 IMAGE CLASSIFICATION IN SWIN TRANSFORMER BASE CONFIGURATION

We further evaluate the performance ACMo E when scaling up model size in Table 19. We integrate ACMo E into the Base configuration of Swin (0.5B parameters) and evaluate on clean Image Net-1K as well as under adversarial atacks.

Table 19: Test Accuracy on Image Net corrupted PGD, FGSM, and SPSA using Swin Base (Liu et al., 2021) backbone

Model Clean Data PGD FGSM SPSA Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 Top 5

Swin-Base (Liu et al., 2021) 79.06 94.37 44.61 79.20 59.91 87.72 68.94 89.00 Swin-ACMo E-Base (Ours) 79.25 94.42 46.28 80.24 61.78 87.55 70.18 89.33

C.9 ROUTER STABILITY

We present in Fig. 5 the routing stability of ACMo E, SMo E, XMo E, and Stable Mo E in the Switch backbone evaluated on Wiki Text-103. Routing instability computes over adjacent layers the proportion of tokens that are assigned to different experts across the two layers. Specifically, for n tokens [h1,...,hn], we compute at layer ℓthe matrix Sℓ Rn n such that Sℓ ij = 1 if the ith and jth tokens are assigned to the same expert in layer ℓand is 0 otherwise. The router instability at layer ℓcan then be calculated as rℓ= mean( Sℓ 1 Sℓ ). This metric therefore captures the degree to which tokens that are assigned to the same experts remain together through the model. A high rℓindicates the router doesn t maintain consistent expert assignments, as tokens that it considers semantically similar at one layer it considers different at the next.

Figure 5: Router Instability of ACMo E, SMo E, XMo E, and Stable Mo E. ACMo E maintains consistent routing, while baseline routers more frequently change the expert assignments of tokens.

In Fig. 5, we see that baseline routers reach high levels of instability, where in the case of SMo E and Stable Mo E, at the last layer over 60% of tokens are assigned to a different expert. ACMo E, by contrast, maintains a more consistent, stable assignment through the model, with no more than 20% of tokens changing expert assignment across any layer.

C.10 DYNAMIC ROUTING

We further test the compatibility of our Adaptive Clustering routing scheme in dynamic top-p routing. In this setting, rather than routing each token to its top-k highest affinity experts in each Mo E

Published as a conference paper at ICLR 2025

layer, we route each token to all experts that have affinity over a certain threshold p. This setting permits activating more or less experts for different tokens at different layers throughout the model, therefore dynamically assigning experts to tokens. We integrate our AC routing directly into this setting using the same setup as in Section 3, where the AC routing transformation is computed based on the estimated cluster membership of each token using the top affinity assignment of the previous layer. We present the results for Switch transformer on Wiki Text-103 language modeling in the following Table 20.

Table 20: Results on Top-p Dynamic Routing in Switch Backbone (Fedus et al., 2022)

Model Test PPL ( )

Fixed top-k routing (Shazeer et al., 2017)

Switch-medium (Fedus et al., 2022) 35.48 ACMo E-medium (Ours) 34.42

Dynamic top-p routing (Guo et al., 2024)

Switch-Fixed p 35.20 Switch-ACMo E-Fixed p (Ours) 34.14

Switch-Learnable p 34.29 Switch-ACMo E-Learnable p (Ours) 33.49

For fixed p, we set p = 0.05. For learnable p, we initialize the parameter to 0.05. We select this initialization as it reproduces approximately similar performance in the Switch backbone under default top-2 routing, thereby aiding direct comparison between fixed top-k and dynamic top-p routing. We see in the dynamic routing setting, ACMo E maintains the same consistent improvement over the Switch baseline of roughly 1 full PPL. These results suggest ACMo E is well-suited to the dynamic routing setting.

D BROADER IMPACT

Our research offers benefits to Mixture-of-Expert (Mo E) architectures in both clean and contaminated settings. In particular, our work offers socially beneficial outcomes with regard to defense against adversarial attack, which we hope can be used to protect important AI systems from malicious actors. Furthermore, as large language models, many of which are built on Mo E backbones, continue to profligate and be used in important societal settings, we hope our improved robustness to data contamination can aid this promising technology to continue to grow and improve in realistic settings of noisy training and evaluation data. Our research also shows substantially faster convergence than comparative baselines. We believe this faster convergence can deliver significant social benefit in terms of reducing the energy requirements of large model training, thereby helping to ease the growing environmental burden of AI training runs. We recognize there will always be risk of misuse with AI systems, however we hope that our work can be used to enhance and protect socially beneficial AI while also decreasing the environmental impact of this technology. We furthermore hope that our research can spur others on to continue building on robust and efficient AI for social good.