# deep_transformers_with_latent_depth__a7011a11.pdf Deep Transformers with Latent Depth Xian Li1, Asa Cooper Stickland2, Yuqing Tang1, and Xiang Kong1 1Facebook AI {xianl, yuqtang, xiangk}@fb.com 2University of Edinburgh {a.cooper.stickland}@ed.ac.uk The Transformer model has achieved state-of-the-art performance in many sequence modeling tasks. However, how to leverage model capacity with large or variable depths is still an open challenge. We present a probabilistic framework to automatically learn which layer(s) to use by learning the posterior distributions of layer selection. As an extension of this framework, we propose a novel method to train one shared Transformer network for multilingual machine translation with different layer selection posteriors for each language pair. The proposed method alleviates the vanishing gradient issue and enables stable training of deep Transformers (e.g. 100 layers). We evaluate on WMT English-German machine translation and masked language modeling tasks, where our method outperforms existing approaches for training deeper Transformers. Experiments on multilingual machine translation demonstrate that this approach can effectively leverage increased model capacity and bring universal improvement for both many-to-one and one-to-many translation with diverse language pairs. 1 Introduction The Transformer model has achieved the state-of-the-art performance on various natural language preprocessing (NLP) tasks, originally in neural machine translation [30], and recently in massive multilingual machine translation [3, 37], crosslingual pretraining [8, 17], and many other tasks. There has been a growing interest in increasing the model capacity of Transformers, which demonstrates improved performance on various sequence modeling and generation tasks [35, 24, 1]. Training Transformers with increased or variable depth is still an open problem. Depending on the position of layer norm sub-layer, backpropagating gradients through multiple layers may suffer from gradient vanishing [19, 31, 5]. In addition, performance does not always improve by simply stacking up layers [6, 31]. When used for multilingual or multi-task pretraining, such as multilingual machine translation, crosslingual language modeling, etc., the simplicity of using one shared Transformer network for all languages (and tasks) is appealing. However, how to share model capacity among languages (and tasks) so as to facilitate positive transfer while mitigating negative transfer has not been well explored. In this work, we present a novel approach to train deep Transformers, in which the layers to be used (and shared) and the effective depth are not static, but learnt based on the underlying task. Concretely, we model the decision to use each layer as a latent variable, whose distribution is jointly learnt with the rest of the Transformer parameters. At training time we approximate the discrete choice with a Gumbel-Softmax [14] distribution. The soft weights sampled from this distribution also act as gradient normalization for each layer, and this allows us to train very deep Transformers (up to 100 layers) without using regular layer normalization layers. At inference time, the learnt discrete choice can be used to directly derive a compact model by pruning layers with low probability, but we 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Layer selection samples Inference Self Attention Encoder Attention Language 1 Language 2 Inference with learnt sub networks Train one model Language 1 ... Language N Language 2 Multilingual Figure 1: We learn the posterior distribution qφ to select" or skip" each layer in Transformers. In multilingual setting, each language learns their own views" of latent layers in a shared Transformer. have the choice of leaving the learned layer selection probabilities as soft weights. By evaluating on WMT 16 English-German machine translation (MT) and masked language modeling (MLM) tasks (similar to the XLM-R model [8]), we show that we can successfully train deeper Transformer (64-layer encoder/decoder model for MT, and 96-layer encoder for MLM) and outperform existing approaches in terms of quality and training stability. We show this approach can be extended to learn task-specific sub-networks by learning different layer selection probabilities for each language pair in multilingual machine translation. This result contributes to the growing interest of learning efficient architectures for multi-task and transfer learning in natural language understanding and generation [28, 12, 7]. The main contributions of this paper are as follows. We present a probabilistic framework to learn which layers to select in the Transformer architecture. Based on this framework, we propose a novel method to train one shared Transformer network for multilingual machine translation with different layer selection probabilities for each language pair. The proposed method alleviates the vanishing gradient issue and enables stable training of deep Transformers. We conduct experiments on several tasks to evaluate the proposed approach: WMT 16 English-German machine translation, masked language modeling, and multilingual many-to-one as well as one-to-many machine translation with diverse languages. Background In this section, we briefly describe the standard Transformer layer architecture [30]. For a hidden state xl of a single token at layer l, each Transformer layer is a function Fl(xl) that transforms its input xl by sequentially applying several sub-layers. The sub-layer is as follows: xl+1 = xl + Sub Layerl(Norm(xl)), (1) where Sub Layerl( ) is either a Self Attention module, an Encoder Attention module (for a Transformer decoder in a sequence-to-sequence model), or a feed-forward network (FFN) module, and Norm( ) is a normalisation layer, usually layer-norm [4]. This is the pre-norm setting which is now widely used [19], as opposed to post-norm in which case Norm( ) would be applied after the residual connection: xl+1 = Norm(xl + Sub Layerl(xl)). 2.1 Latent Layer Selection For each Transformer layer l, we treat the selection of all sub-layers in non-residual block Fl(x) as a latent variable zl from a parameterizable distribution p(z), xl+1 = xl + zl Fl(xl), zl p(z; l) (2) where the standard Transformer [30] is a special case with zl = 1 for l = 0, ..., L 1, where L is the depth of the network, i.e. total number of layers. For the sequence generation task p(y | x) parameterized by a Transformer network with the remaining standard parameters Θ, we assume the following generative process: y p(y | x; θ, z), p(y | x) = Z z p(y | x; Θ, z)p(Θ, z) dΘdz (3) Parameterization and inference of z. We model zl as discrete latent variable from a Bernoulli distribution with zl B(π; l), π [0, 1] indicating select or skip the non-residual block Fl(x) in layer l, and samples from one layer are independent from other layers. This modeling choice allows us to prune layers which reduces inference cost and may regularize training. Marginalizing over z becomes intractable when l grows large. Therefore, we use variational inference as a more general optimization solution. Specifically, we instead maximize the evidence lower bound (ELBO) of Eq. 3 log p(y | x) Eqφ(z)[log pθ(y | x, z)] DKL(qφ(z) p(z)) (4) We point out that although we could treat the rest of the network parameters Θ as latent variables too and model the joint distribution of p(θ, z), which could be optimized using Coupled Variational Bayes (CVB) and optimization embedding as demonstrated in [27] for neural architecture search, in practice we found a simpler optimization procedure (Algorithm 2) to learn both θ and z jointly from scratch. We use the Gumbel-Softmax reparameterization [14] to sample from the approximate posterior qφ(z) which makes the model end-to-end differentiable while learning (approximately) discrete policies without resorting to policy gradients. To allow both soft weighting" and hard selection" of layers, each of which has the appealing property of achieving model pruning while training with larger model capacity, we generate soft samples of z during training and draw hard samples for pruning at inference time if qφ(z) becomes (close to) discrete. We directly learn the logits parameter αl for each layer l: zi l(αl) = exp((αl(i) + ϵ(i))/τ) P i {0,1} exp((αl(i) + ϵ(i))/τ) , ϵ G(0, 1) (5) where G(0, 1) is the Gumbel distribution, and τ is a temperature hyperparameter which increases the discreteness of samples when τ 0. For p(z) we can use the conjugate prior Beta(a, b) which allows us to express different preferences of z, such as a = b = 1 for an uniform prior, a > b to bias towards layer selection and a < b to favor skipping layers. Gradient scaling. Next we analyze the impact of latent layers on gradient backpropagation during training in the pre-norm setting. In Eq. 6, we can see that given the forward pass loss L, the gradient accumulation from higher layers ml