# tamoe_topologyaware_large_scale_mixtureofexpert_training__6f922278.pdf TA-Mo E: Topology-Aware Large Scale Mixture-of-Expert Training Chang Chen1 , Min Li2,3 , Zhihua Wu4, Dianhai Yu4, Chao Yang2,3,5 1Center for Data Science, Peking University 2School of Mathematics Sciences, Peking University 3National Engineering Laboratory for Big Data Analysis and Applications, Peking University 4Baidu Inc. 5Institute for Computing and Digital Economy, Peking University charlie_chen,chao_yang@pku.edu.cn limin_cn@163.com wuzhihua02,yudianhai@baidu.com Sparsely gated Mixture-of-Expert (Mo E) has demonstrated its effectiveness in scaling up deep neural networks to an extreme scale. Despite that numerous efforts have been made to improve the performance of Mo E from the model design or system optimization perspective, existing Mo E dispatch patterns are still not able to fully exploit the underlying heterogeneous network environments. In this paper, we propose TA-Mo E, a topology-aware routing strategy for large-scale Mo E trainging, from a model-system co-design perspective, which can dynamically adjust the Mo E dispatch pattern according to the network topology. Based on communication modeling, we abstract the dispatch problem into an optimization objective and obtain the approximate dispatch pattern under different topologies. On top of that, we design a topology-aware auxiliary loss, which can adaptively route the data to fit in the underlying topology without sacrificing the model accuracy. Experiments show that TA-Mo E can substantially outperform its counterparts on various hardware and model configurations, with roughly 1.01x-1.61x, 1.01x4.77x, 1.25x-1.54x improvements over the popular Deep Speed-Mo E, Fast Mo E and Faster Mo E systems. 1 Introduction The scale of model parameters in neural networks has increased from millions to trillions in recent years, which promotes model accuracy in many domain, such as language processing [3, 4, 5] and computer vision [27, 23]. However, the limited hardware resources, e.g., memory capability and communication bandwidth, have constrained the model size to further scale up. To relieve this tension and improve the model performance, Mixture of Expert (Mo E) with a sparsely gated structure was recently reintroduced [16, 26, 25]. The core structure of Mo E is a group of small "expert" networks and a gate network. Guided by the gate result, input data is dynamically routed to only a sub-group of experts for computation. Compared with dense training methods, the sparsely activated feature of Mo E can significantly reduce the computation burden, extend the model size, and achieve higher accuracy [6, 7, 11, 13]. Equal Contribution. Work done during internship at Baidu Inc.. Corresponding author. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Since Mo E plays a vital role in large-scale model training, efficient Mo E parallel training has recently received much attention. As one of the most popular Mo E training approaches (Figure 1), expert parallelism [11, 7] distributes experts to different devices, and each device is responsible for a different batch of training samples. Correspondingly, extra global communication is necessary for data exchanges among devices. Recent works aim to increase expert parallelism performance from two aspects. On the one hand, the dynamic pattern of Mo E results in severe computation load-imbalance problems that a small number of experts may receive, process, and send the majority of data. Several approaches were proposed to make full use of the available experts, such as adding an auxiliary loss [26], controlling expert capacity [11, 7], and optimizing the assignment scheme for a balanced load [12, 28, 22]. On the other hand, global communication is another main obstacle to efficient Mo E training. Most of the recent works reduced the communication cost from a system perspective, such as computation and communication overlapping [9], customized communication operation acceleration [21, 18], and adaptive routing [17]. In addition to the continuing efforts made to improve the performance of Mo E, there are still two major challenges. With the development of the complicated distributed network environments, the existing even dispatch method may cause network contention in the slowest links, leading to poor communication performance, especially on heterogeneous networks. Although a few early works [9] have proposed methods to dispatch more data to slow links, these methods may make the expert load imbalanced and could influence the model accuracy. Efficient communication demands more delicate dispatch strategies. How to improve the training efficiency without sacrificing the model accuracy is still worth studying. Besides, most of the existing communication optimizations for Mo E [21, 17] are studied with a specific hardware environment. How to develop methods that can adapt to a variety of hardware environments is also of great practical value. To tackle these challenges, we design TA-Mo E, a topology-aware large scale Mo E training method that can adaptively adjust the communication volume to fit the underlying network topology. By abstracting the dispatch problem into an optimization objective based on the communication modeling, we obtain the approximate dispatch pattern under different topologies. On top of that, an auxiliary topology loss with pattern-related coefficients is proposed, which can dynamically adjust the communication volume without interfering with the model convergence. TA-Mo E can also be easily incorporated into the widely used Mo E systems, such as Deep Speed-Mo E [21] and Fast Mo E [8]. We conduct experiments on various typical network topologies and model configurations. Results show that TA-Mo E can substantially outperform Deep Speed-Mo E and Fast Mo E with roughly 1.01x1.61x speedup and 1.01x-4.77x speedup on different configurations without sacrificing the model accuracy. Compared with the recently proposed Hir gate of Faster Mo E, our method can achieve 1.25x-1.54x speedup on time to convergence. Besides, a more detailed analysis of communication and data dispatch pattern further demonstrates the effectiveness of the proposed data dispatch strategy. The code of TA-Mo E is available at: https://github.com/Chen-Chang/TA-Mo E 2 Related Work Several frameworks have featured sophisticated designs to support efficient Mo E training. GShard [11] and Deep Speed-Mo E [21] subtly composed several einsum operators into the computation of Mo E but introduced redundant zero computation and extra memory consumption. Fast Mo E [8] customized the essential computation kernels to improve resource utilization effectively. To further enhance the performance, most of the systems adopted an auxiliary loss [26] to achieve an even dispatch pattern and enforced the number of data processed by each expert below some uniform capacity. Based on these popular implementations, recent works aim to improve the Mo E training performance from mainly two aspects: model structure design and communication optimization. From the perspective of model design, BASE Layer [12] and the work of expert choice routing [28] assigned an equal number of tokens to each expert by delicate designs of the gate. Instead of learning the weight of the gate, Hash Layers [22] adopted an efficient hash function to guide the dispatch. The hybrid structure of PR-Mo E [21] improved the parameter efficiency by fixing one shared expert. Ba Gua Lu [15] re-distributed the data chunks evenly, damaging the model accuracy. However, almost all of these high-level algorithms are agnostic of the complicated underlying hardware effect on training performance. As for communication optimization, Deep Speed-Mo E [21] and Hetu Moe [18] implemented a hierarchical all-to-all communication kernel to improve network utilization. Tutle [17] designed adaptive routing techniques coupled with a specific network architecture. Despite of these delicate designs, the improvement space of system-level optimization is significantly constrained by the dispatch patterns of Mo E. Recently, Faster Mo E [9] made an initial try to take the dispatch pattern into consideration by setting a compulsory ratio of intra-node to inter-node dispatch chunk sizes but sacrificed some model accuracy. In this paper, we propose a topology-aware routing strategy that enables an efficient communication pattern to fit into the underlying topology without sacrificing the convergence performance. 3 Background 3.1 Mo E Model Structure Gate Network Expert Network 0 Global Exchange Aggregate Token 0 Global Exchange Token 1 Gate Network Expert Network 1 Aggregate Token 1 Gate Network Expert Network 2 Aggregate Token 2 Gate Network Expert Network 3 Aggregate Token 3 Figure 1: The popular expert parallelism method of Mo E. A Mo E layer consists of a gate network G and a set of N expert networks E0, . . . , EN 1. For the gate module, the softmax activation function is widely used, reflecting each expert s normalized fitness for dealing with an incoming sample. Usually, only the experts with the top k fit scores are selected to process the sample. The final output y of the Mo E layer is the aggregation of computed results. Expert parallelism has been one of the most popular methods in existing Mo E training systems [21, 8]. As shown in figure 1, the N experts are evenly assigned to P devices, with each device i holding E = N/P experts Ei E, . . . , E(i+1) E 1. Besides, the input tokens are also evenly partitioned over multiple devices with different small batches of the same size S in a traditional data-parallel way. For each process, the shape of the dispatched data is [k S, d], where d represents the hidden size. Each expert receives a combined batch of tokens (Global Exchange) and then carries out the computation. During the global communication, the number of samples sent to Ee from process i is cie, and the shape of the transferred samples is [cie, d]. Afterward, the expert sends the calculated result back with a similar global exchange pattern. However, the number of the tokens processed by different experts may be highly imbalanced that a small group of experts may receive the majority of data, like Expert 2 in Figure 1. Therefore, a load-balance auxiliary loss term laux [26] is added to the train loss as an overall loss function: x G(x)/S, laux = e=0 (mie (cie/S)) (1) The auxiliary loss can dynamically adjust the value of cie into target k S/N. To further ensure load balance, a uniform data process capacity C is set for each expert in many Mo E training systems. Deep Speed-Mo E [21] decomposes the expert capacity C evenly into the local capacity for each process and prunes the exchange size by the local capacity directly: cie Cie = C/P. Fast Mo E [8] efficiently uses the capacity with two extra all-to-all communication for exchange sizes: PP 1 i=0 cie C. 3.2 Network Topology The network environments are very complicated for distributed training on modern GPU clusters. As shown in Figure 2, there are four kinds of typical network topologies: homogeneous, ring, symmetric tree and asymmetrical tree. Homogeneous and ring structures are frequently used topologies for the intra-node environment. For a homogeneous structure, devices are always connected by the network GPU GPU GPU GPU GPU GPU (a) Homogeneous GPU GPU GPU GPU CPU CPU Layer 0 (c) Symmetric Tree GPU GPU GPU GPU GPU GPU CPU CPU CPU Layer 0 (d) Asymmetrical Tree Figure 2: Typical network topologies on modern GPU clusters. (a) A homogeneous node connected with NVSwitch. (b) A typical ring topology connected with NVLinks [19]. (c) A 2-layer symmetric tree topology of [2,2]. (d) A 3-layer asymmetrical tree topology of [[2,2],[2]]. with the same bandwidth, e.g., NVSwitch [20]. As for the ring topology, it is usually symmetrical. The bandwidths between adjacent devices may differ due to different numbers of connected links. The communication of nonadjacent devices has to hop through intermediate devices and the slowest link may become the bottleneck. Hierarchical tree is a common topology abstraction for multi-node distributed environments. Compared with the intra-node environment, inter-node links suffer from limited and volatile bandwidth (4 25GB/s) and potentially degrade the communication performance. For convenience, we denote a tree topology as a nested list where the elements within the same sub-list are connected by the same switch. For a symmetric tree structure, we use Li to represent the number of the child nodes of each node in layer i. As for an asymmetrical tree structure, it is the most common topology for distributed training, which can be very irregular. 3.3 Motivation The existing load-balanced data distribution of Equation 1 is unable fully exploit the complicated distributed environments. To demonstrate it, we set up an experiment on a [2, 2] symmetric tree topology cluster, where the devices are named 0, 1 (same node) and ˆ0, ˆ1 (same node), respectively. We dispatch 128MB 4 data with two dispatch patterns: (1) even dispatch and (2) uneven dispatch that a greater proportion of data is exchanged with a neighbor device. Table 1 shows the detailed dispatch proportions and the corresponding performance. Compared with even dispatch, uneven dispatch improves the overall communication performance by roughly 30%. This is mainly because the communication stress on inter-node links is relieved by transferring a smaller proportion of data. With the variety of distributed network topologies and their continuous development, the existing static even dispatch pattern is not effective enough. There is an urgent need for a routing strategy that can dynamically adapt to the underlying network environments. Table 1: The communication performance of [[0,1],[ˆ0, ˆ1]] network topology. Dispatch Pattern 0 0 0 1 0 ˆ0 0 ˆ1 All Ratio of data Even 1/4 1/4 1/4 1/4 1 Uneven 1/4 1/2 1/8 1/8 1 Time (µs) Even 144 758 5609 5618 14019 Uneven 144 1492 2835 2861 10765 4 Topology Aware Routing Strategy In this section, we first abstract the data dispatch problem into an optimization objective based on the communication model. Through some analysis, we obtain the target dispatch pattern under different topologies, which can eliminate the communication bottleneck during Mo E training. Guided by the target pattern, we design a topology-aware routing loss to adaptively adjust the dispatch volume. 4.1 Communication Model We characterize the communication cost using the well-known α-β cost model, where α and β represent the fixed communication latency and the inverse bandwidth (i.e., transferring costs of each word), respectively. For convenience, αij and βij are used to denote the latency and inverse bandwidth between the i-th and j-th GPU. During the training of Mo E, the amount of data transferred 4Here, 128MB is used as a demonstration, which is the upper-bound transfer size of most of the typical Mo E training tasks. from GPU i to Ee in GPU j is cie d b, where d b is the transferred element size. To reduce the overheads of multiple send-receives between two GPUs, we merge the multiple small data chunks into a larger data chunk for delivery. The total amount of data delivered from GPU i to GPU j is PE (j+1) 1 e=E j cie d b. A global data exchange consists of P P peer-to-peer data deliveries, among which the slowest delivery, as a lower-bound, constrains the final communication performance. Most of the global exchange implementations [21, 18, 24] are designed to approach the lower-bound. Therefore, our ultimate objective function is to minimize the slowest send-receive communication cost: min c max i,j (αij + βij E (j+1) 1 X e=E j cie d b) (2) For efficient Mo E training, two constraint conditions should be satisfied. First, for any process i, the sent data size, i.e., k S, should be equal to the sum of received data size of all experts: e {0,...,N 1} cie, i {0, ..., P 1}. (3) Second, to make full use of all the experts and pursue a better model accuracy, the data chunks dispatched to each expert should be balanced: i {0,...,P 1} cie, e {0, ..., N 1}. (4) 4.2 Model Optimization To get the target dispatch pattern, we need to solve the optimization problem in Equation 2. Nevertheless, Equation 2 contains plentiful parameters of a specific network, which complicates the solving process. Meanwhile, in some irregular topologies, some devices may suffer from quite limited bandwidth when communicating with other devices. According to Equation 2, the experts assigned to these devices may receive a quite small dispatch chunk size from the other processes, which may make the experts lack of sufficient data exchanges and lead to expert isolation phenomenon. To tackle these problems, we simplify the optimization problem to accelerate the solving process and smooth the values of αij, βij for an approximate result to prevent expert isolation. Since each send-receive communication shares the same α, β in homogeneous network, the target dispatch chunk size ˆ cie is equal to the load-balanced chunk size k S N . In the following part, we focus on the analysis of the optimization problem under heterogeneous topologies. On a n-layer symmetric tree topology, for any device i, all the devices can be split into n sub-groups of Gi = {Gi t|t < n}. Gi t is the group of devices whose shortest path from device i are across t switches. Multiple hops in cross-switch communication will suffer from extra overheads and the most limited bandwidth in the hops dominates the final bandwidth. Therefore, we can simplify the original αij, βij into n value: αl = i