# ultrasparse_memory_network__2f34a753.pdf Published as a conference paper at ICLR 2025 ULTRA-SPARSE MEMORY NETWORK Zihao Huang , Qiyang Min , Hongzhi Huang , Defa Zhu, Yutao Zeng, Ran Guo, Xun Zhou Seed-Foundation-Model Team, Byte Dance {huangzihao.notabot,minqiyang,huanghongzhi.51,zhudefa, yutao.zeng,guoran.94,zhouxun}@bytedance.com It is widely acknowledged that the performance of Transformer models is logarithmically related to their number of parameters and computational complexity. While approaches like Mixture of Experts (Mo E) decouple parameter count from computational complexity, they still face challenges in inference due to high memory access costs. This work introduces Ultra Mem, incorporating large-scale, ultrasparse memory layer to address these limitations. Our approach significantly reduces inference latency while maintaining model performance. We also investigate the scaling laws of this new architecture, demonstrating that it not only exhibits favorable scaling properties but outperforms Mo E. In experiments, the largest Ultra Mem we train has 20 million memory slots. The results show that our method achieves state-of-the-art inference speed and model performance within a given computational budget, paving the way for billions of slots or experts. 1 INTRODUCTION 128 256 512 1024 Consumed tokens (Billion) Validation Loss Mo E-1.6B-2in34 Ultra Mem-1.6B-x12 Dense-1.6B Dense-6.5B (a) Validation loss 100 101 102 103 Batch Size Mo E-1.6B-2in34 Ultra Mem-1.6B-x12 Dense-1.6B (b) Inference time 100 101 102 103 Batch Size Memory access (GB) 18.1x18.8x17.3x13.0x Mo E-1.6B-2in34 Ultra Mem-1.6B-x12 Dense-1.6B (c) Memory access Figure 1: We ensured that three models have the same computation, and Mo E and Ultra Mem have the same parameters. The x-axis is plotted on a logarithmic scale. In (b) and (c), the sequence length is 1 because during decoding time, we can only predict one token at a time, and the key/value cache length is 2048. The experiments in (b) and (c) are conducted on the A100-SXM-80GB. Recent advancements in natural language processing (NLP), driven by Large Language Models (LLMs) (Radford et al., 2019; Brown, 2020), require exponentially more computational resources as they scale, posing challenges in resource-limited environments like real-time applications. To address computational issues, the Mixture of Experts (Mo E)(Fedus et al., 2022; Jiang et al., 2024) and Product Key Memory (PKM)(Lample et al., 2019) have been introduced. Mo E selectively activates parameters, boosting training efficiency but impairing inference time due to increased memory access. PKM maintains consistent memory access with fewer value embeddings but its performance is significantly worse than Mo E. As shown in Figure 1(b), an Mo E model, despite having the same computational cost and twelve times more parameters than a dense model, runs 2 to 6 times slower in inference, varying by batch size. Equal contribution. Published as a conference paper at ICLR 2025 This slowdown, as depicted in Figure 1(c), stems from high memory access demands, highlighting its inefficiency in inference scenarios. The primary challenge is how to match or even surpass the effectiveness of the Mo E model while maintaining memory access levels comparable to those of dense models. In this paper, we introduce Ultra Mem, an architecture that builds upon and extends the concepts from PKM. Ultra Mem incorporates large-scale, ultra-sparse memory layers that significantly enhance computational efficiency and reduce inference latency while maintaining or even improving model performance across various benchmarks. This architecture not only supports the deployment of highly effective language models in resource-constrained environments but also opens up new avenues for constructing even larger models without the previously associated prohibitive costs. In summary, we make the following contributions: 1. Ultra Mem is greatly enhanced compared to PKM, and outperforms Mo E at same scale. Compared to PKM, Ultra Mem truly possesses the prerequisites for training large-scale models on extensive computational resources and has undergone comprehensive experimental validation. 2. Ultra Mem has significantly lower memory access cost during inference compared to Mo E. Under common inference batch sizes, it can be up to 6 times faster than Mo E with the same parameters and calculations. The inference speed of Ultra Mem is almost identical to that of a dense model with equivalent computational resources. 3. We have verified the scaling ability of Ultra Mem. Similar to Mo E, Ultra Mem has strong scaling ability, and we have observed stronger scaling ability than Mo E. 2 RELATED WORK Mixture of Expert. Shazeer et al. (2017) proposed Mo E and Fedus et al. (2022) introduced the Mo E in large language models, where each token selects one expert for inference each time, thereby increasing model parameters without increasing computation. Rajbhandari et al. (2022) introduced the concept of shared experts, where each token utilizes some fixed experts along with some unique experts. Subsequent research has focused on improving the gating functions of Mo E, including token choice (Chi et al., 2022), non-trainable token choice (Roller et al., 2021) and expert choice (Zhou et al., 2022), primarily to address the issue of expert imbalance. Liu et al. (2024); Dai et al. (2024) opted to slice the experts into smaller segments while activating more experts per token, achieving significant performance improvements. Concurrent study (Krajewski et al., 2024) meticulously explored the benefits of granularity and increasing the number of experts, alongside investigating the scaling laws associated with Mo E. In this paper, we use fine-grained Mo E as our baseline, wherein the granularity of the Mo E is set to 2. This means that each expert is half the size of the original Multi Layer Perceptron (MLP) , with two experts activated per token. Large Memory Layer. Lample et al. (2019) first introduced the concept of large memory layer, called PKM, which can be seen as slicing the Mo E experts to the smallest possible configuration. Kim & Jung (2020) introduced a concept similar to shared experts in Mo E, allowing PKM and MLP to operate in parallel. Csord as et al. (2023) made a slight modification to PKM by removing the Softmax operation. PEER (He, 2024) improved the activation of values in PKM to activate a small expert with an inner dimension of 1, achieving significant performance gains. However, current research on PKM is limited to smaller models, and even the latest improved versions of PKM only outperform Mo E in certain scenarios. Additionally, current PKM do not possess characteristics suitable for large-scale training. We address these issues in this paper. Tensor decomposition breaks down a tensor into a series of small matrices or tensors. In deep learning research, such methods are commonly used to approximate a large tensor during training, aiming to save on computation and parameters. Product quantization (Jegou et al., 2010) breaks a vector into multiple sub-vectors, allowing us to reconstruct the original vector using a smaller number of sub-vectors, thereby reducing the model parameters. Bershatsky et al. (2024) initializes several matrices and a core tensor, trains these parameters during the fine-tuning phase, and reconstructs the original large tensor in a manner of Tucker Decomposition at the end of training to reduce training costs. We borrow this insight to improve PKM s key retrieval. Published as a conference paper at ICLR 2025 Ge LU values fetch values : Weighted sum pooling (a) Multilayer perceptron values fetch values fetch values (b) Large memory layer Figure 2: An overview of multilayer perceptron (MLP) and large memory layer (LML). For the sake of brevity, we omit the third top-m operation from memory layer. An MLP typically consists of two linear layers and a Ge LU activation. We consider the weights of the first linear layer as keys, and those of the second linear layer as values. LML uses row and column keys to determine the 2-D logical address to index memory values, whereas MLP uses 1-D logical address. fetch value refers to retrieving values based on the indices with higher scores. 3.1 PRELIMINARY Here we firstly introduce the origin large memory layer (LML) based on product keys, which serves as the foundation for our proposed approach. The concept of a product key-based memory layer (PKM) was first explored in prior work (Lample et al., 2019). In their approach, the authors incorporated an external memory module into language models, with the goal of expanding the model s parameters while maintaining a similar level of computational complexity. The overall structural diagram is depicted in Figure 2(b). A memory layer generally consists of two parts: keys K RN Dk and values V RN Dv. To retrieve information from memory values, a query vector q RDk finds most relevant values by multiplying keys to obtain scores. The higher the scores are, the better impact should the values have. Consequently, this process can be formulated as: s = σ(Kq) o = V s, (1) where s is the scores, σ is a non-linear activation, o is the output. Attention layers, who memorize context contents, and MLP layers, who memorize world knowledge, also follow the above formulation with σ being Soft Max in attention layers and Ge LU in MLP layers (Geva et al., 2020) (see Figure 2(a)). Product-key memory layers scale up the memory size with N > 106, while activating only a few values with top-m scores. Here, m is a hyper-parameter controlling sparsity. Though values are sparsely accessed, the keys, which are as large as values, must be fully computed to obtain scores before top-m activation following equation 1. To alleviate the computation complexity for keys, product keys are proposed. Borrowing the idea of Product Quantization, it utilizes a 2-D logical address (see Figure 2(b)), typically a n n grid where n = N, for memory value retrieval. Specifically, a 2-D logical address (i, j) is used to index memory value at physical address n i + j. With such strategy, logical scores are then represented as a matrix, which is further decomposed as an addition of row and column scores: srow = σTop M(Krowqrow(x)), scol = σTop M(Kcolqcol(x)), (2) Sgrid = σTop M(srow + s col), o = V Soft Max(vec(Sgrid)), (3) where Krow, Kcol Rn Dk, qrow, qcol : RDi RDk convert input hidden x RDi to row and column query, σTop M( ) preserves top-m largest elements in the input and set the rest to negative infinity, and the matrix addition with unmatched matrix shape is implemented by element broadcasting. It should be noted that removing σTop M from equation 2 does not make any difference. The only Published as a conference paper at ICLR 2025 reason for applying top-m to the row and column scores is to reduce the computation for the last top-m operation on Sgrid. As srow, scol have only m activated scores, Sgrid has only m2 candidates for top-m operation rather than N, i.e., top-m complexity reduces from O(N log m) to O(( N + m2) log m). Note that the Sgrid undergoes a Soft Max operation akin to the one employed in the self-attention mechanism. Moreover, PKM adopts the multi-head mechanism from the self-attention module, wherein it utilizes multiple key sets to retrieve the shared values, we denote H as the number of PKM heads. 3.2 STRUCTURE IMPROVEMENTS Improve PKM with a bag of tricks. We first studied the structure of PKM and found that a series of minor adjustments can steadily improve the model s performance: 1) We remove the operation Softmax in equation 3, which is well-established in the studies (Shen et al., 2023; Csord as et al., 2023). 2) We conduct Layer Normalization (LN) (Ba et al., 2016) on query and keys for stability of training. 3) PKM suggests using a constant learning rate of 0.001 to learn the values, which is much higher than the learning rate for other parameters. We found that gradually decaying the value learning rate provides further benefits. 4) PKM uses a linear layer to generate query, we add a causal depthwise convolutional layer (Howard, 2017) before this linear layer to enhance query. 5) Similar to Group Query Attention (Ainslie et al., 2023), we share query in two key sets. This can reduce the computational cost of generating the query by half, with little performance impact. 6) By halving Dv, we double the number of values. Under the condition of keeping the activation value parameter unchanged, we increased the diversity of activated values, and the model effect is further improved. In order to make the output consistent with hidden dimension, we add a linear layer on the aggregated output. Ultra Mem Overall structure. We then take a deeper investigation into the model structure and propose Ultra Mem. Figure 3 shows the PKM and our improved Ultra Mem structure, based on a Pre-Layer Norm Transformer architecture. PKM replaces MLP or operates in parallel (Kim & Jung, 2020) with MLP in the one of deeper layers with memory layer. We notice three drawbacks to PKM: Transformer layer Transformer layer Ultra Mem block Transformer layer Transformer layer Transformer layer Transformer layer Transformer layer Transformer layer Ultra Mem block Transformer layer Transformer layer Figure 3: Overall of PKM and Ultra Mem. 1. As value size N significantly increases, queries can harder find correct values. 2. Product key decomposition introduces bias on retrieval topology. For example, let (i, j) be the logical address for the top1 score, then top-2 score must be located on row i or column j, which significantly limits the diversity of top-m selection. 3. There are issues with unbalanced multi GPU computation and communication during large-scale parameter training, as the full model parameters cannot be placed on a single GPU. To alleviate problems 1 and 3, we decompose this large memory layer into multiple smaller memory layers distributed at fixed intervals across the transformer layers. Additionally, this skip-layer structure allows us to overlap the execution of the memory layer and the transformer layers, as the memory layer is predominantly memory-bound during training. Published as a conference paper at ICLR 2025 tucker core (2x2) merged row score merged col score Step 4 merged row score merged col score top-m row index top-m col index top-m row index top-m col index filtered row score filtered col tucker core key_num + = merged score top m final score top m index index fetch final index top m row index top m col index filtered row score filtered col Figure 4: Flow of Tucker Decomposed Query-Key Retrieval (TDQKR), here r = 2. The term fetch refers to the action of retrieving scores based on a given index (corresponding to torch.gather ). TDQKR replacing Product Quantization, serves as a more precise retrieval module for recalling value indices in Ultra Mem. Each step of the TDQKR process is meticulously referenced within the main text for understanding. Tucker Decomposed Query-Key Retrieval (TDQKR). We explore a more complex multiplicative approach to alleviate problem 1 and 2, where a tucker decomposition (Malik & Becker, 2018) is adopted in place of product quantization. The whole process of TDQKR is illustrated in Figure 4. Specifically, tucker decomposition estimates grid scores with rank-r matrix multiplication: Srow = Krowqrow(x), Scol = Kcolqcol(x), (4) Sgrid = σTop M(S row C Scol), (5) where Srow, Scol Rr n and C Rr r is the tucker core, which is a learnable parameter with random initialization. To produce n r shaped row and column score, the dimensions of the query and key are reshaped, resulting in Krow, Kcol Rr n (Dk/r) and qrow, qcol Rr (Dk/r), corresponding to Figure 4 step 1. However, equation 5 is inefficient to be directly applied in practice, as the top-m operation cannot be simplified with an equivalent two-phase top-m technique like product quantization can. As a consequence, we propose an approximated top-m algorithm to tackle this problem. The key is to do rank-1 approximation for the tucker core, so that the overall top-m can be approximated by: C ut , σTop M(S row C Scol) σTop M((u Srow) (t Scol)) (6) where u, t Rr 1. Note that (u Srow), (t Scol) R1 n are row vectors, then the two-phase topm technique pertains to the approximated objective σTop M((u Srow) (t Scol)), corresponding to Figure 4 step 3. Overall, we conduct approximated top-m on row and column scores, filtering out non-top elements, then we use the concrete objective in the final top-m operated on Sgrid, keeping index scores precise: C ut (7) Srow =ITop M(u Srow) Srow (8) Scol =ITop M(t Scol) Scol (9) Sgrid =σTop M( S row C Scol), (10) where ITop M( ) is binary value function, which converts top-m elements to 1 and otherwise to 0. Equation 8&9 corresponding to Figure 4 step 4&5,and Equation 10 corresponding to Figure 4 step 6&7. As for the rank-1 approximation, we leverage Singular Value Decomposition (SVD) (Abdi, 2007) to factorize the tucker core with u, t be the left and right singular vectors corresponding to the leading singular value, corresponding to Figure 4 step 2. Published as a conference paper at ICLR 2025 Physical Memory Virtual memory address table fetch value Given score and index : Weighted sum pooling : Virtual block1 : Virtual block2 : Virtual block3 : Virtual block4 Figure 5: Flow of Implicit Value Expansion (IVE), here E = 4, m = 16. IVE reduces memory access and scales up memory size by expanding the memory table virtually. Each virtual block is a reparameterization of the physical memory table. Every virtual memory address corresponds to a physical memory address and a projector index. The weighted sum pooling is grouped by the virtual blocks, followed by a linear layer to produce the final output. Last but not the least, the approximation error should be concerned when non-maximum singular values are as large as the maximum one. To mitigate this, an auxiliary loss that manages approximation error is introduced during training by constraining non-maximum eigenvalues: C = UΛT , (by SVD) (11) Laux = α r 1 i=2 (max (0, λi τ))2 , (12) where, Λ denotes the singular values for C in descending order, with τ serving as a margin to prevent C from degenerating into a rank-1 matrix, and α is the coefficient for the loss. Implicit Value Expansion (IVE). Though sparsely used, maintaining a large memory table is still costly during training due to the large amount of memory access. To reduce memory access as well as scale up the memory size, we propose virtual memory as an implicit value expansion. Given a virtual expansion rate E > 1, virtual memory expands the memory table into E times size. We design virtual memories as multiple reparameterizations of the original memory value V, which we denote as physical memory. Then, E linear projectors {Wp|p [1, E], Wp RDv D v} are utilized, and virtual memory block Vp corresponding to the p-th reparameterization can be defined as: Vp = VWp. (13) Then the overall virtual memory is a concatenation of the virtual blocks V = [ V 0 , V 1 , . . . , V E] . Note the dimension of virtual values D v is not necessarily consistent with the dimension of physical values Dv. To apply the virtual memory is intuitive, where memory table can be replaced from V to V. And to fit virtual memory size, the key size is expanded by E times. Moreover, we suggest a random shuffle for virtual memory to eliminate some unnecessary index topology prior introduced by row and column scoring. Concretely, if the virtual memory tables are unioned by concatenation, each memory value and its expansions would be located in the same column in logical address, and thus can be potentially more frequently chosen simultaneously. A naive reparameterization for virtual memory still introduces lots of computations, which is E N Dv D v, and E times GPU memory access. A better idea is to compute reparameterization on demand. That is, we expand the logical address to triplets (i, j, p) where (i, j) is the original logical address and p is index for the virtual memory block, and then simultaneously conduct sum pooling and compute virtual memory value. Consequently, equation 3 is rewritten as: ˆs =Shuffle(vec(Sgrid)), (14) o = V ˆs = X p V p ˆsp = X p W p V ˆsp (15) where ˆsp represents the scores corresponding to p-th virtual memory block. With equation 15, we can firstly lookup and pool values according to the virtual block index and then transform the reduced Published as a conference paper at ICLR 2025 physical values directly into reduced virtual values. This trick reduces extra computation from E N Dv D v to E B Dv D v, where B is the number of tokens in batch, and has nearly no extra GPU memory access except for the linear projectors. Figure 5 shows the flow of IVE. Multi-Core Scoring (MCS). PKM shares a single score across dimension Dv for each value. Empirically, assigning multiple scores to a single value has shown to enhance performance. Thus, we rewrite the tucker core C as a series of component cores C = Ph i C(i). This allows employing {C(i)}h i=1 to generate individual score maps S(i) tucker = S row C(i)Scol. Obviously, Stucker = S row( i C(i))Scol = i S row C(i)Scol = i S(i) tucker. (16) We keep top-m conducted on aggregated score Stucker, while applying individual scores S(i) tucker on vertically split value table V = [V(1), . . . , V(h), ] with V(i) RN (Dv/h), i.e., o = [ˆs(1) V(1), . . . ,ˆs(h) V(h)] . (17) When this technique incorporates with IVE, we split physical memory values instead of virtual memory values to keep the equivalence in equation 15. Improved initialization. PKM initializes values with a Gaussian distribution N(0, 1 Dv ). Since PKM applies Softmax to the scores, the variance of the pooled outputs is 1/Dv. We argue that LML should be considered as a component similar to an MLP and, therefore, should use an initialization method akin to that of MLPs. Before training, the output of an MLP typically follows a Gaussian distribution N(0, 1 2L) (Brown, 2020), where L is the total number of layers. We initialize value with N(0, E 2m HL), where m is the activated value number, H is the head number, E is the value expansion times. To ensure that the output distribution of Ultra Mem is N(0, 1 2L), We need to confirm that the mean of top-m score is 1, details see Appendix A. 4 QUANTITATIVE ANALYSIS WHY ULTRAMEM INSTEAD OF MOE The most effective method for enhancing model capacity without significantly raising computational costs is Mo E. This strategy employs a set of specialized sub-models, known as experts , which work together to tackle complex problems. However, the Mo E model poses challenges for inference processes. Consider the Transformer hidden dimension as D, the inner dimension of MLP is 4D, given the inference batch size as B. Using the Mo E with 2in Nmoe (choose 2 in Nmoe experts per token) as an example, where the inner dimension of expert is 2D. Assuming the expert chosen is fully balanced, we can get the memory access of single Mo E layer as min(2B, Nmoe) 2D2. For the Ultra Mem, assuming value dimension is D/2, and each token activates the top-m values, then its memory access is min(Bm, N) D/2. As the batch size increases, the memory access of Mo E grows rapidly until it reaches an upper limit where all expert parameters need to be accessed. In contrast, the memory access of Ultra Mem increases very slowly, only reaching parity with Mo E when the batch size is in the tens of thousands. However, in inference scenarios, the batch size is typically not very large. Figure 1 shows the inference time and memory access of a 1.6 billion parameter Transformer with 2in34 Mo E and 121 Ultra Mem. For larger batch sizes, see Figure 7 in Appendix. Compared to Mo E, Ultra Mem achieves the maximum acceleration of 6 at a batch size of 64, and also shows significant acceleration at other batch sizes. 5 EXPERIMENTS In this section, we demonstrate the scaling capabilities of Ultra Mem, showing that it outperforms Mo E. We additionally show how the performance of Ultra Mem varies with different top-m values and the number of parameters, and perform an ablation study to measure the impact of each part of Ultra Mem. 1The number of parameters in Ultra Mem is 12 times the number of parameters in the dense layer. In this case, the total parameters and total computation of Ultra Mem are the same as the 2in34 Mo E. Published as a conference paper at ICLR 2025 Datasets. Training data comes from Red Pajama (Computer, 2023), containing 1 trillion tokens. Red Pajama represents a clean-room, fully open-source version of the LLa Ma (Touvron et al., 2023) dataset. Validation data includes the C4 validation set (Raffel et al., 2020), derived from the Common Crawl web corpus. The C4 training set is also incorporated within the Red Pajama training data. Tokenizer is based on the GPT-Neo X (Black et al., 2022) tokenizer, which uses the Byte-Pair Encoding (BPE) (Sennrich et al., 2015) algorithm and has a vocabulary size of 50,432. Evaluation. We conducted a comprehensive evaluation of all models across ten benchmark datasets. These datasets included MMLU, Trivia-QA, GPQA, and ARC for assessing the models knowledge capabilities; BBH, Bool Q, Hella Swag, and Wino Grande for evaluating reasoning skills; DROP for testing reading comprehension abilities; and AGIeval for measuring overall model performance. The decoding hyperparameters are aligned with those of LLa MA3 (Dubey et al., 2024). Detrails see Appendix E. Training details. We used a standard pre-norm transformer (Xiong et al., 2020) with rotary embeddings (Su et al., 2024) for our dense models, which have 151M, 680M, 1.6B, and 6.5B parameters2. For sparse models, including Ultra Mem, PKM and Mo E, we expand the sparse parameters twelvefold from the 151M, 680M, and 1.6B dense models. In Mo E models, two experts are activated per token (Jiang et al., 2024), using a balance loss (Fedus et al., 2022) weight of 0.01 to ensure even expert selection. We slightly increased the width of Mo E s experts to match Ultra Mem s computational and parameter costs. In Ultra Mem models, the auxiliary loss weight is α = 0.001 and margin τ = 0.15. The learning rate for values is ten times other parameters and decays linearly. For model structure and hyperparameters details, see Appendix E, and for large-scale training optimizations, see Appendix C, D. 5.2 EVALUATION ON LANGUAGE MODELING DATASETS We evaluate models of various sizes, the results are shown in Table 13, where FLOPs is the computation cost of single token, the curves showing changes over the course of training are provided in the Figure 11 in Appendix. We observe that as the model capacity increases, Ultra Mem can outperform PKM and Mo E with the same parameter and computation. On the 1.6B dense model, an Ultra Mem model with 12x the parameters can match the performance of a 6.5B dense model. Table 1: Performance metrics of various models. Model Param FLOPs Val. GPQA Trivia QA BBH Hella Wino DROP Avg (B) (G) loss cot Swag Grande Dense-151M 0.15 0.30 2.96 19.98 12.67 22.57 35.07 52.49 13.60 26.06 PKM-151M-x12 2.04 0.35 2.76 17.30 24.66 23.14 42.25 51.38 13.10 28.64 Mo E-151M-2in32 2.04 0.35 2.63 17.30 33.27 23.24 48.44 55.96 18.57 33.20 Ultra Mem-151M-x12 2.03 0.35 2.67 19.42 28.97 22.65 43.96 50.83 14.08 29.99 Dense-680M 0.68 1.36 2.64 21.09 27.16 24.65 48.83 54.93 22.97 33.27 PKM-680M-x12 8.95 1.50 2.46 20.65 46.31 26.97 57.32 61.72 25.20 39.70 Mo E-680M-2in33 8.95 1.50 2.39 20.54 34.19 26.63 62.71 59.98 26.54 38.43 Ultra Mem-680M-x12 8.93 1.49 2.37 21.99 55.17 26.62 64.15 60.54 25.14 42.27 Dense-1.6B 1.61 3.21 2.49 21.76 39.65 26.41 58.6 61.72 22.63 38.46 PKM-1.6B-x12 21.13 3.48 2.34 22.99 48.92 28.98 65.45 63.93 27.55 42.97 Mo E-1.6B-2in34 21.36 3.52 2.30 21.32 59.56 29.46 67.34 63.93 28.81 45.07 Ultra Mem-1.6B-x12 21.41 3.50 2.24 24.66 66.38 30.63 71.52 66.38 29.99 48.26 Dense-6.5B 6.44 12.88 2.30 19.98 57.28 31.14 69.73 65.9 33.12 46.19 2Excludes tokenizer vocabulary embedding and prediction head parameters. 3This table only includes evaluation results where the metrics have steadily increased with training. For all results, see the Table 7 in Appendix. Published as a conference paper at ICLR 2025 16 64 256 1024 Consumed tokens (Billion) Validation Loss Mo E-151M-2in32 Mo E-680M-2in33 Mo E-1.6B-2in34 PKM-151M-x12 PKM-680M-x12 PKM-1.6B-x12 Ultra Mem-151M-x12 Ultra Mem-680M-x12 Ultra Mem-1.6B-x12 Dense-151M Dense-680M Dense-1.6B Dense-6.5B (a) Scaling validation loss 8 16 32 64 Ratio of Sparse v.s. Dense Param. Validation Loss 640K 320K 160K 80K 40K 20K (b) Loss across varying sparsity 1 2 4 8 16 32 64 Ratio of Sparse v.s. Dense Param. top-m=8 top-m=16 top-m=32 top-m=64 top-m=128 top-m=256 top-m=512 Mo E (c) Speed across varying sparsity Figure 6: (a). C4 validation loss of different models at different scale. (b). Scaling curves at different sparsity with 151M activated parameters. Each line represents the same model sparsity; e.g., 20K indicates that approximately one out of every 20,000 values will be activated. The loss decreases linearly as the sparse parameters increase exponentially. (c). Inference time for Ultra Mem and Mo E with 1.6B activated parameters. The batch size is 512, sequence length is 1, and key/value cache length is 2048. With fixed activation parameters, Ultra Mem s inference time remains nearly constant as sparse parameters increase, while Mo E s inference time increases significantly. 5.3 VALUE NUMBER AND TOP-m In most sparse LLMs, such as Mo E and Ultra Mem, there is a clear positive correlation between sparsity and model performance. Therefore, in this section, we conduct a series of scaling experiments by varying selected top-m and value number, i.e., the parameters of the sparse modules, to verify the changes in model performance with respect to sparsity. The result is shown in Figure 6(b). It is evident that, at the same level of sparsity, the validation loss decreases as the number of parameters increases and can maintain a certain degree of decline. Additionally, the smaller the sparsity, i.e., the larger the proportion of activated parameters, the better the model performance. However, this also results in a higher memory access overhead. Thus, there is a trade-off between memory access volume and scaling efficiency. In our final experiments, we selected a sparsity ratio of 80K as the default model configuration. As sparse parameters increase, Figure 6(c) shows that Ultra Mem maintains stable inference time despite exponential growth in parameters, as long as activated parameters (top-m) stay constant. In contrast, Mo E s inference time rises significantly under similar conditions. Additionally, Figure 1(b) demonstrates that with smaller batch sizes, Mo E s inference speed deteriorates even more compared to Ultra Mem. 5.4 ABLATION We conduct comprehensive ablation studies based on the 151M dense model. In the baseline, the PKM is a version that operates in parallel with the MLP, making it a stronger baseline. For this group of experiments, the learning rate (LR) is set to 1.2e-4, with training on 500B tokens and evaluating the cross entropy loss on the training and C4 validation sets. We ensure that the parameter count and computational cost of the final version of the model were essentially at the same level. Table 2 shows the ablation results. We identify 6 changes that significantly improved performance: 1. Doubling the number of values while halving their dimension, and simultaneously double the top-m selections to keep the active parameters consistent. 2. Splitting a single Ultra Mem into multiple smaller units evenly across the transformer layers, with outputs skipping several blocks. This arrangement keeps the total parameter count, computational cost, and sparse parameter activation at or below pre-split levels. Published as a conference paper at ICLR 2025 Table 2: Ablation study of model improvements Train Loss Valid. Loss Dense Param.(M) Sparse Param.(G) FLOPs (M) PKM-151M-x10 2.604 2.828 173.01 1.534 346.06 +rm softmax 2.570 -0.034 2.822 -0.006 173.01 1.534 346.06 +half vdim+proj 2.556 -0.014 2.800 -0.022 178.47 1.529 356.98 +share query 2.560 +0.004 2.803 +0.003 173.46 1.529 346.96 +split big mem&skip 2.554 -0.006 2.788 -0.015 161.64 1.536 323.32 +query/key LN 2.553 -0.001 2.789 +0.001 161.64 1.536 323.54 +IVE 2.544 -0.009 2.772 -0.017 172.37 1.536 344.98 +TDQKR 2.538 -0.006 2.764 -0.008 172.37 1.536 344.98 +MCS 2.521 -0.017 2.761 -0.003 172.37 1.536 344.98 +improved init 2.518 -0.003 2.758 -0.003 172.37 1.536 344.98 +value lr decay 2.494 -0.024 2.736 -0.022 172.37 1.536 344.98 +query conv 2.493 -0.001 2.736 -0.000 172.38 1.536 345.02 Total Diff -0.111 -0.092 -0.64 +0.002 -1.04 3. Tucker Decomposition Query-Key Retrieval introduces negligible additional parameters while reducing computation, here r = 2. 4. Multi-Core Scoring significantly reduces training loss, and slightly reduces validation loss, here h = 2. 5. Implicit Value Expansion slightly increases both the parameter count and computational cost, but the improvement is significant, here E = 4. 6. The LR for the value parameters starts at ten times that of the other parameters and linearly decays to match them by the end of training. Among other changes, sharing the query helps cut computational costs with a minor trade-off in performance. Normalizing the query/key greatly reduces spikes in training perplexity and enhances training stability, as shown in Figure 10.(a). Improved initialization prevents score and output variance explosions in the early to middle training stages, detailed in Figure 10.(b) and (c). Additionally, employing convolution further limits the variance divergence in Ultra Mem outputs(Figure 10.(c)). The above results are based on incremental ablation. Results from independent ablation can be found in the Table 8 in Appendix, and they align with our expectations. Beside, We conduct another ablation studies on IVE, TDQKR, and MCS with different configurations, which are documented in Table 3. For IVE, as E increases, there is a consistent improvement in model performance alongside a notable increase in computational cost. However, the marginal gains decrease as E rises, leading us to recommend E = 4. For TDQKR and MCS, increasing r and h does not significantly change the computational load, but the effectiveness no longer shows marked improvement, hence we suggest using r = 2 and h = 2. Table 3: Ablation of different config on IVE, TDQKR, and MCS IVE TDQKR MCS Baseline E=4 E=9 E=16 Baseline r=2 r=3 r=4 Baseline h=2 h=4 h=8 Training loss 2.553 -0.009 -0.016 -0.019 2.544 -0.006 -0.0065 -0.0063 2.538 -0.017 -0.017 -0.012 Validation loss 2.789 -0.017 -0.025 -0.027 2.772 -0.008 -0.0084 -0.0082 2.764 -0.003 +0.001 +0.006 FLOPs(G) 323.54 +6.6% +14.9% +26.4% 344.98 +0.001% +0.002% +0.003% 344.98 +0.001% +0.003% +0.007% 6 CONCLUSION In this paper, we introduce Ultra Mem, which, compared to Mo E, has minimal memory access and therefore achieves up to a sixfold speed advantage. Concurrently, in terms of performance, Ultra Mem surpasses Mo E with the same parameters and computation as model capacity increases, indicating its superior scaling capability. This work presents a promising direction for developing more efficient and scalable language models. Published as a conference paper at ICLR 2025 ACKNOWLEDGMENTS We extend our deepest gratitude to Pingshuo Ma and Wenda Liu for their invaluable assistance in optimizing the early stage training of large-scale Ultra Mem. We also appreciate the inference optimization work for the Ultra Mem carried out by Siyan Chen, as well as Fan Xia s efforts in assessing the inference speed of Mo E. Herv e Abdi. Singular value decomposition (svd) and generalized singular value decomposition. Encyclopedia of measurement and statistics, 907(912):44, 2007. Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr on, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. ar Xiv preprint ar Xiv:2305.13245, 2023. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016. Daniel Bershatsky, Daria Cherniuk, Talgat Daulbaev, Aleksandr Mikhalev, and Ivan Oseledets. Lotr: Low tensor rank weight adaptation. ar Xiv preprint ar Xiv:2402.01376, 2024. Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle Mc Donell, Jason Phang, et al. Gpt-neox-20b: An open-source autoregressive language model. ar Xiv preprint ar Xiv:2204.06745, 2022. Tom B Brown. Language models are few-shot learners. ar Xiv preprint Ar Xiv:2005.14165, 2020. Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, et al. On the representation collapse of sparse mixture of experts. Advances in Neural Information Processing Systems, 35:34600 34613, 2022. Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. ar Xiv preprint ar Xiv:1905.10044, 2019. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ar Xiv preprint ar Xiv:1803.05457, 2018. Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/Red Pajama-Data. R obert Csord as, Kazuki Irie, and J urgen Schmidhuber. Approximating two-layer feedforward networks for efficient transformers. ar Xiv preprint ar Xiv:2310.10837, 2023. Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixtureof-experts language models. ar Xiv preprint ar Xiv:2401.06066, 2024. Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. ar Xiv preprint ar Xiv:1903.00161, 2019. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. ar Xiv preprint ar Xiv:2407.21783, 2024. William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1 39, 2022. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. ar Xiv preprint ar Xiv:2012.14913, 2020. Published as a conference paper at ICLR 2025 Xu Owen He. Mixture of a million experts. ar Xiv preprint ar Xiv:2407.04153, 2024. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. ar Xiv preprint ar Xiv:2009.03300, 2020. AG Howard. Mobilenets: Efficient convolu-tional neural networks for mobile vision applications. ar Xiv preprint ar Xiv:1704.04861, 2017. Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117 128, 2010. Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. ar Xiv preprint ar Xiv:2401.04088, 2024. Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. ar Xiv preprint ar Xiv:1705.03551, 2017. Gyuwan Kim and Tae-Hwan Jung. Large product key memory for pretrained language models. ar Xiv preprint ar Xiv:2010.03881, 2020. Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pi oro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Kr ol, Tomasz Odrzyg o zd z, Piotr Sankowski, et al. Scaling laws for fine-grained mixture of experts. ar Xiv preprint ar Xiv:2402.07871, 2024. Guillaume Lample, Alexandre Sablayrolles, Marc Aurelio Ranzato, Ludovic Denoyer, and Herv e J egou. Large memory layers with product keys. Advances in Neural Information Processing Systems, 32, 2019. Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-ofexperts language model. ar Xiv preprint ar Xiv:2405.04434, 2024. Osman Asif Malik and Stephen Becker. Low-rank tucker decomposition of large tensors using tensorsketch. Advances in neural information processing systems, 31, 2018. Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick Le Gresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1 15, 2021. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1 67, 2020. Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International conference on machine learning, pp. 18332 18346. PMLR, 2022. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. ar Xiv preprint ar Xiv:2311.12022, 2023. Stephen Roller, Sainbayar Sukhbaatar, Jason Weston, et al. Hash layers for large sparse models. Advances in Neural Information Processing Systems, 34:17555 17566, 2021. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99 106, 2021. Published as a conference paper at ICLR 2025 Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. ar Xiv preprint ar Xiv:1508.07909, 2015. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ar Xiv preprint ar Xiv:1701.06538, 2017. Kai Shen, Junliang Guo, Xu Tan, Siliang Tang, Rui Wang, and Jiang Bian. A study on relu and softmax in transformer. ar Xiv preprint ar Xiv:2302.06461, 2023. Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick Le Gresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. ar Xiv preprint ar Xiv:1909.08053, 2019. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. Mirac Suzgun, Nathan Scales, Nathanael Sch arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. ar Xiv preprint ar Xiv:2210.09261, 2022. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ee Lacroix, Baptiste Rozi ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023. Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pp. 10524 10533. PMLR, 2020. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? ar Xiv preprint ar Xiv:1905.07830, 2019. Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. ar Xiv preprint ar Xiv:2304.06364, 2023. Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103 7114, 2022. Published as a conference paper at ICLR 2025 A ULTRAMEM INITIALIZATION We initialize value with N(0, E 2k HL), where k is the activated value number, H is the head number, E is the value expansion times. To ensure that the output distribution of Ultra Mem is N(0, 1 2L), We need to confirm that the mean of top-m score is 1. Assuming the candidate score follows N(0, 1), and k K. We can simplify the problem as follows: Given N standard Gaussian distributed random variables X1, ..., Xn, and the random variable Y = mean(topm(X1, ..., Xn)), find the expected value E(Y ). It is difficult to obtain an analytical solution for E(Y), so we approximate E(Y) by sampling M times N points from a Gaussian distribution and calculating the mean of the top-m values. Then we initialize the query layer norm weight as 1/ p E(Y ), the keys layer norm weight as 1/ Dk to ensure the expected of candidate score is 1. B INFERENCE TIME AND MEMORY ACCESS Figure 7 shows that Ultra Mem has a much slower growth in memory access compared to Mo E, only aligning with Mo E in terms of memory access when the batch size reaches 131,072, and it continues to have an advantage in inference speed. 100 101 102 103 104 105 Batch Size Mo E-1.6B-2in34 Ultra Mem-1.6B-x12 Dense-1.6B (a) Inference time 100 101 102 103 104 105 Batch Size Memory access (GB) Mo E-1.6B-2in34 Ultra Mem-1.6B-x12 Dense-1.6B (b) Memory access Figure 7: Inference time and memory access of Transformer, Mo E and Ultra Mem. We ensured that three models have the same computation, and Mo E and Ultra Mem have the same parameters. The x-axis and y-axis are both plotted on a logarithmic scale. The sequence length is 1 because during inference, we can only predict one token at a time, and the key/value cache length is 2048. The modes run on the A100-SXM. C MEGATRON SUPPORT FOR TRAINING EFFICIENCY As memory table scales towards billions even trillions of parameters, model parallelism becomes essential to distribute model parameters and optimizer states across multiple devices to ensure they fit into device memory and are trainable within a reasonable time frame. We leverages Megatron s (Shoeybi et al., 2019; Narayanan et al., 2021) 3D parallelism (pipeline parallelism, data parallelism, and tensor parallelism) for training. However, several parallelism modifications are required to support the memory table effectively. Because pipeline parallelism cannot address scenarios where a single layer s parameters exceed the memory capacity of a single device, and tensor parallelism is typically limited to a relatively small group of GPUs, making it insufficient to meet the memory table s memory requirements. Consequently, we propose sharding the memory table across a combination of data parallel and tensor parallel groups or its subgroups, to ensure efficient distribution and scalability. The memory table can be partitioned either number-wise or dimension-wise. The entire process of number-wise and dimension-wise partitioning, along with their communication volume analysis and guidance on how to choose the appropriate partitioning method, is detailed in Appendix D. In our structural improvements, halving v dim can simultaneously reduce the communication overhead for Published as a conference paper at ICLR 2025 (a) Number-wise partitioning (b) Dimension-wise partitioning Figure 8: Process of Number-wise partitioning and Dimension-wise-partitioning. The weighted sum pooling step is omitted in the diagram. both number-wise and dimension-wise partitioning. However, increasing top-m will proportionally increase the communication overhead. Additionally, Implicit Value Expansion, due to the increase in the size of values after weighted sum pooling, will further impact the communication volume for dimension-wise partitioning. To further augment performance, several key modifications have been implemented: Fused Lookup-Reduce Operator: This newly introduced operator accelerates computations and reduces memory usage by combining the lookup and weighted sum pooling operations into a single, more efficient step. Asynchronous Execution Strategy: Recognizing the benefits of cross-layer utilization of the memory layer, we have adopted an asynchronous execution strategy. This strategic choice allows for the concurrent processing of memory calculations alongside dense network operations, substantially enhancing the overall system performance. These enhancements demonstrate the efficacy of our parallelism strategy within the Megatron framework, paving the way for more efficient training of large-scale models. D NUMBER-WISE AND DIMENSION-WISE PARTITION DETAILS Figure 8 shows the process of number-wise and dimension-wise partition. For number-wise partitioning, we first perform an all-to-all on indices to distribute them to the corresponding devices. After the lookup operation, the results are sent back to the original devices, then do weighted sum pooling. For dimension-wise partitioning, we need to perform an all-gather operation on indices to obtain all indices across devices. The lookup operation is then performed, dimension-wise partitioning allows the results to be sent back to each device after completing the weighted sum pooling. Published as a conference paper at ICLR 2025 200 400 600 800 1000 1200 1400 P topm=2 topm=8 topm=256 Figure 9: Relationship between P and v dim for communication volume of number-wise / dimensionwise equals 1, the shaded area is number-wise / dimension-wise greater than 1 Assuming the memory table is distributed across P processors, the communication volume can be described as follows: Number-wise Partitioning Communication Volume (not considering indices deduplication): All-to-all transmission of indices: sizeof(int) bs topm (P 1)/P All-to-all transmission of embeddings after lookup: sizeof(bfloat16) bs topm v dim (P 1)/P Dimension-wise Partitioning Communication Volume: All Gather indices: sizeof(int) bs topm (P 1) All Gather scores: sizeof(bfloat16) bs topm (P 1) All-to-all transmission of embeddings post-lookup reduction: sizeof(bfloat16) bs v dim (P 1)/P Here v dim is the value dimension, bs is the batch size times sequence length. Figure 9 shows the relationship between P and v dim for communication volume of these two partitioning methods, helping us choose the appropriate partitioning method under a fixed configuration. Published as a conference paper at ICLR 2025 E EXPERIMENT SETTING Table 4 displays common hyper-parameter settings for all experiments. LR stands for Learning Rate, corresponding to the values 6e-4, 2.5e-4, 2e-4, and 1.2e-4 for dense models with sizes 151M, 680M, 1.6B, and 6.5B, respectively (Brown, 2020). Regarding the insertion of Ultra Mem and PKM, for Ultra Mem-151M, it s 3:5/6:8/9:11, where 3:5 indicates that Ultra Mem input is taken from layer 3 and inserted back into the output of layer 5, and so on. For Ultra Mem-680M, it s 3:7/8:12/13:17/18:22. For Ultra Mem-1.6B, it s 3:7/8:12/13:17/18:22/23:27/28:32. For PKM-151M, it s 6:6. For PKM-680M, it s 12:12. For PKM-1.6B, it s 16:16. The settings for Ultra Mem and Mo E models align with their dense counterparts based on dense parameter size. Table 6 shows the model parameter setting used in scaling experiments. What s more, the common setting for Ultra Mem is shown in Table 5. Configuration Key Value Weight decay 0.1 β1 0.9 β2 0.95 LR 6e-4/2.5e-4/2e-4/1.2e-4 LR end ratio 0.1 LR schedule cosine LR warmup ratio 0.01 Dropout 0.1 Batch size 2048 Sequence length 2048 Training step 238418 Table 4: Training hyper-parameters Configuration Key Value Tucker rank r 2 Multi-core scoring h 2 Virtual memory expansion E 4 Aux loss weight α 0.001 Aux loss margin τ 0.15 Table 5: Common Ultra Mem configuration Evaludation datasets. We use 10 benchmarks to evaluate all kind of models. 1. Knowledge: Massive Multitask Language Understanding (MMLU) (Hendrycks et al., 2020), Trivia QA (Joshi et al., 2017), Graduate-Level Google-Proof Q&A Benchmark (GPQA) (Rein et al., 2023), AI2 Reasoning Challenge (ARC) (Clark et al., 2018). 2. Reasoning: BIG-Bench Hard (BBH) (Suzgun et al., 2022), Boolean Questions (Bool Q) (Clark et al., 2019), Hella Swag (Hella) (Zellers et al., 2019), Wino Grande (Wino) (Sakaguchi et al., 2021). 3. Reading comprehension: Discrete Reasoning Over Paragraphs (DROP) (Dua et al., 2019). 4. Comprehensive ability: AGIEval (Zhong et al., 2023) Model Hidden Dim Expert Kdim Knum Mem Layer Dense-151M 1024 4096 16 12 - - - - - 0.15 0.30 Dense-680M 1536 6144 16 24 - - - - - 0.68 1.36 Dense-1.6B 2048 8192 16 32 - - - - - 1.61 3.21 Dense-6.5B 4096 16384 32 32 - - - - - 6.44 12.88 Mo E-151M-2in32 1024 2528 16 12 2 32 - - - 2.04 0.35 Mo E-680M-2in33 1536 3584 16 24 2 33 - - - 8.95 1.50 Mo E-1.6B-2in34 2048 4672 16 32 2 34 - - - 21.36 3.52 PKM-151M-x12 1024 4096 16 12 16x6 - 512 1347 1 2.04 0.35 PKM-680M-x12 1536 6144 16 24 35x8 - 768 2308 1 8.95 1.50 PKM-1.6B-x12 2048 8192 16 32 42x12 - 896 1792 1 21.44 3.52 Ultra Mem-151M-x10 1024 4096 16 12 16x2 - 256 1024 3 1.71 0.35 Ultra Mem-151M-x12 1024 4096 16 12 16x2 - 256 1100 3 2.03 0.35 Ultra Mem-680M-x12 1536 6144 16 24 35x2 - 384 1632 4 8.93 1.49 Ultra Mem-1.6B-x12 2048 8192 16 32 42x2 - 448 1792 6 21.41 3.50 Table 6: Model parameter setting. Top-m means chosen expert number in Mo E, means chosen value number times head number in PKM and Ultra Mem. Kdim means the key dimension in PKM and Ultra Mem. Knum means the number of keys, Knum2 is the number of values. F MORE EXPERIMENT RESULTS Published as a conference paper at ICLR 2025 0 200 400 Tokens (Billions) Training Perplexity w/o query/key LN with query/key LN (a) Training perplexity 100 101 102 Tokens (Billions) Baseline Improved Init. (b) Top1 score 0 200 400 Tokens (Billions) Ultra Mem final layer output std Baseline + Improved Init. + Query conv. (c) Ultra Mem output standard deviation Figure 10: Model training state details. Top1 Score refers to the highest score among the retrieved keys. Ultra Mem Output Std represents the standard deviation of the outputs from the last layer of Ultra Mem. Table 7: All performance metrics of various models Model Param FLOPs ARC-C GPQA Trivia MMLU BBH Bool Q Hella Wino AGI DROP Avg Model (B) (G) QA cot Swag Grande Eval Dense-151M 0.15 0.30 25.60 19.98 12.67 26.50 22.57 50.15 35.07 52.49 9.03 13.60 26.77 PKM-151M-x12 2.04 0.35 25.94 17.30 24.66 25.69 23.14 53.48 42.25 51.38 9.65 13.10 28.66 Mo E-151M-2in32 2.04 0.35 26.96 17.30 33.27 26.58 23.24 55.96 48.44 55.96 9.34 18.57 31.56 Ultra Mem-151M-x12 2.03 0.35 25.68 19.42 28.97 25.62 22.65 47.74 43.96 50.83 10.00 14.08 28.89 Dense-680M 0.68 1.36 24.06 21.09 27.16 24.64 24.65 46.42 48.83 54.93 9.44 22.97 30.42 PKM-680M-x12 8.95 1.50 25.51 20.65 46.31 25.22 26.98 41.80 57.32 61.72 8.94 25.20 33.97 Mo E-680M-2in33 8.95 1.50 25.17 20.54 34.19 24.38 26.63 43.70 62.71 59.98 7.39 26.54 33.13 Ultra Mem-680M-x12 8.93 1.49 23.72 21.99 55.17 24.97 26.62 48.20 64.15 60.54 8.26 25.14 35.88 Dense-1.6B 1.61 3.21 26.30 21.76 39.65 26.19 26.41 51.50 58.6 61.72 9.22 22.63 34.81 PKM-1.6B-x12 21.13 3.48 26.71 22.99 48.92 24.80 28.98 60.06 65.46 63.93 9.51 27.55 37.89 Mo E-1.6B-2in34 21.36 3.52 25.43 21.32 59.56 26.18 29.46 42.78 67.34 63.93 6.63 28.81 37.14 Ultra Mem-1.6B-x12 21.41 3.50 25.94 24.66 66.38 24.67 30.63 59.8 71.52 66.38 8.77 29.99 40.88 Dense-6.5B 6.44 12.88 28.16 19.98 57.28 27.68 31.14 68.2 69.73 65.9 9.23 33.12 41.04 Table 8: Independent ablation study of model improvements Train Loss Valid. Loss PKM-151M-x10 2.604 2.828 + rm softmax -0.034 -0.006 + half vdim+proj -0.027 -0.02 + share query -0.003 -0.002 + split big mem -0.003 -0.005 + query/key LN -0.002 +0.003 + IVE -0.025 -0.023 + TDQKR -0.003 -0.007 + TDQKR + MCS -0.02 -0.009 + value lr decay -0.017 -0.007 + query conv -0.005 -0.001 Published as a conference paper at ICLR 2025 128 256 512 1024 Consumed tokens (Billion) Dense-1.6B Dense-151M Dense-6.5B Dense-680M Mo E-1.6B-2in34 Mo E-151M-2in32 Mo E-680M-2in33 PKM-1.6B-x12 PKM-151M-x12 PKM-680M-x12 Ultra Mem-1.6B-x12 Ultra Mem-151M-x12 Ultra Mem-680M-x12 (a) Average accuracy 128 256 512 1024 Consumed tokens (Billion) Dense-1.6B Dense-151M Dense-6.5B Dense-680M Mo E-1.6B-2in34 Mo E-151M-2in32 Mo E-680M-2in33 PKM-1.6B-x12 PKM-151M-x12 PKM-680M-x12 Ultra Mem-1.6B-x12 Ultra Mem-151M-x12 Ultra Mem-680M-x12 (b) BBH-cot-3shot accuracy 128 256 512 1024 Consumed tokens (Billion) Dense-1.6B Dense-151M Dense-6.5B Dense-680M Mo E-1.6B-2in34 Mo E-151M-2in32 Mo E-680M-2in33 PKM-1.6B-x12 PKM-151M-x12 PKM-680M-x12 Ultra Mem-1.6B-x12 Ultra Mem-151M-x12 Ultra Mem-680M-x12 (c) DROP accuracy 128 256 512 1024 Consumed tokens (Billion) Dense-1.6B Dense-151M Dense-6.5B Dense-680M Mo E-1.6B-2in34 Mo E-151M-2in32 Mo E-680M-2in33 PKM-1.6B-x12 PKM-151M-x12 PKM-680M-x12 Ultra Mem-1.6B-x12 Ultra Mem-151M-x12 Ultra Mem-680M-x12 (d) Hellaswag accuracy 128 256 512 1024 Consumed tokens (Billion) Dense-1.6B Dense-151M Dense-6.5B Dense-680M Mo E-1.6B-2in34 Mo E-151M-2in32 Mo E-680M-2in33 PKM-1.6B-x12 PKM-151M-x12 PKM-680M-x12 Ultra Mem-1.6B-x12 Ultra Mem-151M-x12 Ultra Mem-680M-x12 (e) Winogrande 5shot accuracy 128 256 512 1024 Consumed tokens (Billion) Dense-1.6B Dense-151M Dense-6.5B Dense-680M Mo E-1.6B-2in34 Mo E-151M-2in32 Mo E-680M-2in33 PKM-1.6B-x12 PKM-151M-x12 PKM-680M-x12 Ultra Mem-1.6B-x12 Ultra Mem-151M-x12 Ultra Mem-680M-x12 (f) Trivia QA 5shot accuracy Figure 11: The changes in accuracy for all observable evaluation throughout the training.